readium / webpub-manifest

📜 A JSON based Web Publication Manifest format used at the core of the Readium project
BSD 3-Clause "New" or "Revised" License
91 stars 23 forks source link

JSON Schema - metadata.source? #14

Open danielweck opened 5 years ago

danielweck commented 5 years ago

metadata.source

Currently parsed from EPUB (OPF XML):

metadata.source = OPF /package/metadata/dc:source/text()

danielweck commented 5 years ago

Seems to be missing from https://github.com/readium/webpub-manifest/blob/master/schema/metadata.schema.json

danielweck commented 5 years ago

Related issue: https://github.com/readium/webpub-manifest/issues/8

HadrienGardeur commented 5 years ago

I have mixed feelings about this one. While we declare it in the JSON-LD context document, we don't really talk about it elsewhere.

It's currently mapped to http://schema.org/isBasedOn but given the rather "flexible" nature of dc:source (and DublinCore in general), it's very likely that we won't get a URL out of it. I would rather remove it and provide support as part of our extensibility model by referencing the DublinCore element directly instead.

HadrienGardeur commented 5 years ago

My advice would be to map this to schema.org and use the inherent extensibility of the model in various implementations.

This could be covered in the parsing doc in the future cc @JayPanoz

danielweck commented 5 years ago

Same with dc:rights

metadata.rights = OPF /package/metadata/dc:rights/text()?

HadrienGardeur commented 5 years ago

I see two potential way of dealing with this:

I would only follow the first approach for metadata that we consider important enough, everything else should simply use our extensibility.

mickael-menu-mantano commented 5 years ago

I'm trying to add the remaining <meta> properties into otherMetadata. But I don't know what to do with those namespace prefixes (eg. dc in dc:source). Should we remove it from the key? Or expand it like Hadrien mentioned (http://purl.org/dc/elements/1.1/source)?

I didn't find any way to get the namespace URL for a given prefix in our Swift XML lib (or access the xmlns:* attributes for that matter), so I'm not sure I can expand them for unknown prefixes or custom ones (eg. xmlns:dc-alias="http://purl.org/dc/elements/1.1/"). It might be something that can be fixed in the dependency if the feature is really needed, because it's built on top of libxml2.

Edit: This is wrong, see https://github.com/readium/webpub-manifest/issues/14#issuecomment-498610168

HadrienGardeur commented 5 years ago

Expanding them to a full URL is probably the right way to handle this for now.

mickael-menu-mantano commented 5 years ago

Okay, I think I found something in libxml2 that I could expose to Swift (node->nsDef) to get the xmlns:x attributes.

mickael-menu-mantano commented 5 years ago

I have three related questions:

  1. How do we handle multiple values (eg. several <meta> with same property or name)? The entries in the otherMetadata could be either String or [String] if multiple values are found.

  2. What do we do with refines properties?

  3. What about those kind of tags: <dc:rights>Public Domain</dc:rights>? Do we look for any tag with dc namespace in metadata and use the tag as the key?

mickael-menu-mantano commented 5 years ago

I just found out that the prefixes used in meta[@property] are not XML namespace prefixes, but actually declared in package[@prefix] with a list of default prefixes (http://www.idpf.org/epub/301/spec/epub-publications.html#sec-metadata-reserved-vocabs). Great news, I won't have to fiddle with an external dependency.

Reading Systems must resolve all reserved prefixes used in Package Documents using their pre-defined URIs. Reserved prefixes should not be overridden in the prefix attribute, but Reading Systems must use such local overrides when encountered.

JayPanoz commented 5 years ago

Ah yeah, just to reinstate that you should feel free to complete the parsing doc.

For starters, it’s not complete by any means, so having at least a documented reference will help discuss it and fine-tune handling.

Then there is metadata I’m honestly not familiar with, and don’t know how to handle – I legit don’t know what authors expect for some metadata for instance, but having a written reference makes it (more) easily sharable/reviewable by others.

Finally I’m confident others are more knowledgeable than me when it comes to some metadata I’ve never used as an author.

mickael-menu-mantano commented 5 years ago

For reference, I came up with this solution on Swift: https://github.com/readium/r2-streamer-swift/commit/037de4d2c15f2697176f7eb26e315dd0b01ea236?diff=unified#diff-b8124a64cd2aa700aa444e5f9a7d7232

This generates the otherMetadata JSON but is also used to access metadata from their name and associated vocabulary. This is safer than directly querying for properties like rendition:layout in case the author uses a different prefix for the rendition vocabulary.

(@HadrienGardeur My three questions are still relevant though: https://github.com/readium/webpub-manifest/issues/14#issuecomment-498591021)

To author the additional metadata, I'm using this list of "known" properties to ignore, maybe it should be something documented and shared between platforms (or better, if we share a unit test file that covers those cases):

// List of properties that should not be added to `otherMetadata` because they
// are already consumed by the RWPM model.
private let rwpmProperties: [OPFVocabulary: [String]] = [
    .defaultMetadata: ["cover"],
    .dc: ["contributor", "creator", "publisher"],
    .dcterms: ["contributor", "creator", "modified", "publisher"],
    .media: ["duration"],
    .rendition: ["flow", "layout", "orientation", "spread"]
]

Finally, here's an example of JSON produced for publication.metadata:

"metadata": {
    "http://www.idpf.org/epub/vocab/package/a11y/#certifiedBy": "EDRLab",
    "http://purl.org/dc/elements/1.1/source": ["Feedbooks", "Web", "Internet"],
    "http://purl.org/dc/elements/1.1/rights": "Public Domain",
    "http://idpf.org/epub/vocab/package/#type": "article",
    "http://my.url/#customProperty": "Custom property",
    "rendition": {
        "spread": "both",
        "overflow": "scrolled",
        "orientation": "landscape",
        "layout": "fixed"
    }
}

from

<package prefix="
  rend-alias: http://www.idpf.org/vocab/rendition/#
  myPrefix: http://my.url/#">
  <metadata>
    <dc:source>Feedbooks</dc:source> 
    <meta property="dc:source">Web</meta> 
    <meta name="dc:source" content="Internet"/> 
    <dc:rights>Public Domain</dc:rights> 
    <meta property="rendition:layout">pre-paginated</meta>
    <meta property="rend-alias:orientation">landscape</meta>
    <meta property="rendition:flow">scrolled-doc</meta>
    <meta property="rendition:spread">both</meta>
    <meta property="og:type">article</meta>
    <meta property="a11y:certifiedBy">EDRLab</meta>
    <meta property="myPrefix:customProperty">Custom property</meta>
  </metadata>
</package>
danielweck commented 5 years ago

EPUB "reserved" prefixes: in practice they are probably never overridden (why would content creators want to take that risk) ... but they can! (even though they are "reserved")

Reserved prefixes SHOULD NOT be overridden in the prefix attribute, but Reading Systems MUST use such local overrides when encountered. As changes to the reserved prefixes and updates to Reading Systems are not always going happen in synchrony, Reading Systems MUST NOT fail when encountering unrecognized prefixes (i.e., not reserved and not declared using the prefix attribute).

https://w3c.github.io/publ-epub-revision/epub32/spec/epub-packages.html#sec-metadata-reserved-prefixes

...so strictly-speaking, Mickael's approach makes sense :)

HadrienGardeur commented 5 years ago

How do we handle multiple values (eg. several with same property or name)? The entries in the otherMetadata could be either String or [String] if multiple values are found.

I think they could be either:

What do we do with refines properties?

This is the use case where we could use an object:

"http://my.url/#customProperty": {
  "@value": "Main value",
  "http://my.url/#customPropertyUsedInRefine": "Refine value"
}

What about those kind of tags: Public Domain</dc:rights>? Do we look for any tag with dc namespace in metadata and use the tag as the key?

You've covered this properly in your examples IMO.

danielweck commented 4 years ago

Related issue: https://github.com/readium/webpub-manifest/issues/66