w3c / pub-manifest

W3C Publication Manifest
https://w3c.github.io/pub-manifest
Other
7 stars 16 forks source link

How Metadata Works in the Publishing World #57

Open dauwhe opened 5 years ago

dauwhe commented 5 years ago

In much of the EPUB world, the metadata that matters is not inside the EPUB, but outside (in the form of ONIX). The metadata inside EPUBs is often wrong, is difficult to change, and there is very little incentive to make it accurate since it's mostly unused.

In the web world, page metadata directly affects search ranking, Google rich snippets, etc. There is no out-of-band transmission of metadata. There is strong incentive to make it accurate.

How do we avoid the situation with EPUB, where we've spent decades worrying about metadata, continually changing how it's expressed, without really benefiting users?

mattgarrish commented 5 years ago

This is generally why its a bad idea for publication specifications to dive so deeply into metadata vocabularies.

We wanted to provide a framework for metadata expressions with EPUB 3, but got sucked into the metadata vortex of despair by introducing some "essential" metadata that didn't seem to exist but that looked like the non-ONIX folks would need. That's led to EPUB being looked at as having to define the essential metadata, when metadata expressions really should be figured out at the publishing/publication level.

I opened https://github.com/w3c/wpub/issues/429 in part because I see a lot of the same happening here. The more we layer in the more it looks like what we exclude doesn't matter, and that leads to more requests to include things. Plus the more we recommend for certain areas of publishing the more annoying we make metadata for others.

The manifest is somewhat unencumbered now that most metadata is optional, but we still define a whole lot of concepts that really aren't essential to user agents (dates, etc.).

During 3.1, we started to look at defining prescriptive metadata guidelines for publishers using alternative means, like best practices documents. Call me old fashioned, but it still strikes me as the best balance. Let each community define what it wants to express and how it wants to express it outside the specification.

iherman commented 5 years ago

Well... we should be careful. The presence of, e.g., ONIX is clearly important for trade publishing. But we also know from our discussions that, at this moment, Publication Manifests will not be considered by trade publishing for some times which will keep to EPUB 3.x.

What is the situation in other areas? E.g., the little I know about scholarly, where (at least for journals) the "publication" content is not dominated by packaging, which put things in a very different perspective.

mattgarrish commented 5 years ago

The presence of, e.g., ONIX is clearly important for trade publishing. But we also know from our discussions that, at this moment, Publication Manifests will not be considered by trade publishing for some times which will keep to EPUB 3.x.

Right, I'm not suggesting we pick a side in that debate.

I just look at the publication manifest and I see a "format" that is itself also one big metadata expression language so we're already in an ideal scenario. We don't need to turn the specification into a rehash of schema.org or dcterms or any other scheme, as these are already accommodated.

I just feel we're better off staying out of the metadata sphere as much as we can. If we can't pin a property to something the user agent needs for some specific purpose, then the property probably doesn't belong in the specification.

We need to find a way to empower the publishing communities (hint, hint) to work out the details of what metadata belongs in the manifest when an ONIX record isn't the primary source of that descriptive detail, and for these communities to publish notes or guides for each relevant publishing realm.

llemeurfr commented 5 years ago

It strikes me since I came in this publishing domain that EPUB metadata lack a clear use case: who are these metadata made for? And the question is now the same for publication manifest metadata.

IMO they should be made for end users, helping users classify and find publications they have "acquired" and are present on their large "bookshelf" or "personal library". This is not discovery / commercial data to be used by booksellers (ONIX is made for that). This is not classification data to be used by academic libraries (MARC is good for that).

Once the use case is clear, we can decide which metadata are useful and which are not so.

gregoriopellegrino commented 4 years ago

As a note, for possible new versions of the specifications: comparing the metadata available in EPUB and those available in the pub-manifest, I noticed the lack of some information, which is used in real use cases. These are:

llemeurfr commented 4 years ago

@gregoriopellegrino the only metadata in your list I don't see the use for end-users (readers) is the rights information: if a user has a publication in his hands, what is the use of rights information for him? would it contain things like "you, reader, have the right to do this, but do not have the right to do that, with the publication you have acquired"?

gregoriopellegrino commented 4 years ago

In EPUBs was used as copyright information "© 2019 Publisher"

llemeurfr commented 4 years ago

@gregoriopellegrino if a consensus is found around a copyright notice (I would support it), then a "copyrightNotice" property would be more interesting than a "rights" property then.

We can have a look at the news industry, where a copyrightNotice property is defined (https://www.iptc.org/std/NewsML-G2/guidelines/#copyright-notice) as a child of a bigger "rightsInfo" structure (https://www.iptc.org/std/NewsML-G2/guidelines/#rights-metadata).

schema.org has another way: copyrightHolder + copyrightYear. In case of consensus around the concept, we'll have to choose our way.

iherman commented 4 years ago

This issue was discussed in a meeting.

mrjj commented 4 years ago

schema.org has another way: copyrightHolder + copyrightYear. In case of consensus around the concept, we'll have to choose our way.

JFI we have practical case when something like copyrightEndYear can matter. For example publisher is purchasing not exclusive sub-license on classic work for example "The Hobbit, or There and Back Again", usually its time-limited for 3-10 years, otherwise huge capex will be worse than a possible discounts. In this case its important to preserve known publisher sub-license revocation date across further distribution chain. On case if publisher stop control distribution legal status for any reason. Usually i see this date defined on agreement papers in legal dept and agreements between publisher and distribution, but no single place for time mark like restrict download of this bundle after YYYY-MM-DD

Schema.org very unclear about start year as: The year during which the claimed copyright for the CreativeWork was first asserted, because without information about authority who asserted IPR transfer/contract/any other agreement its just a 4 digits number. Following the practical case as end user i may want to see publisher copyright notice and ensure that i'm not witness of any infringement and not gaining civil responsibility to report about it, if there is notice, its enough to be sure i'm not responsible side. But if i want check this agreement as gov worker, first of all i may want to know which authority to contact besides publisher. If this authority is assigning any identifiers related to rights transfer fact its fine and clear where to place any identifier of this kind. Year of first copyright assertion by holder party in both cases seems to be the minor detail especially without detailed information about all related paperwork.

EPUB license manifest template stored on license server managed by Readium platform covers this case, its possible to check on server as well as it defining time cap for any bundled EUL manifest file.

Internally to describe license status resolution we are using model very close to IPR Transfer of Inde_c_s framework. Its not a fit for DPub/EPUB 3.x but good tool to model different license status. According to this model copyrightHolder appears to be very unspecific note.

But TransferAction Schema.org concept seems to be good and compatible equivalent. Maybe it worth considering as possible recommended description practice and i may be answer to the questions where to put all this parties, dates, rights and so on. Its really hard to define perfect couple of fields for this needs. CreativeWork itself not that good for this. Legal side of business knowledge usually about linkage forming legal status of all entities in their sum with temporal bindings, not about one thing with limitless properties. From my side i see a coming up with idea to add some more aspects there as a sanity crime, So i see some sense in using existing part of model related to business activity being linked with work as it supposed, And preserve only place for EULA and information about responsible parties/authorities (possible to use on practice like reg num/contact info) for the case of transmission or bundling.

The other thing that happened to be very important for our digital publishing activity is a public domain status of work. And (ongoing/past) date of this status transfer. But this is a question as complex as the whole current topic.