Manifest files need their own MIME Media Type (because canonicalization)

There is additional processing (beyond JSON, beyond JSON-LD) required for generating a canonical manifest. Consequently, a new MIME media type is needed to signal that this unique processing is required.

Additionally, the processing steps required for canonicalization alter the meaning of the original JSON-LD. The result is a different graph containing new statements from those in the authored manifest, so a simple JSON-LD "profile" would be insufficient (as profiles can't alter the encoded document).

So, something like application/wpub+json would be best as it signals the foundational format (JSON) as well as the short name of the W3C spec which describes the required processing steps.

The WPUB spec could also declare the default @context value for the media type as done by ActivityStreams in their IANA submission.

This also has the benefit of opening the door to profiling the manifest format more clearly via the profile parameter with a value pointing to an extended context or new definition document.

This would create significant problems. The JSON-LD spec requires the application/ld+json in the case of an embedded manifest. Schema.org follows this rule (rightfully so) and, for example, the Structured data testing tool fails to accept a script element with any other mime type than application/ld+json.

If we do this, we cut ourselves from the JSON-LD world.

Schema.org does not currently define an additional processing model beyond JSON-LD, so they don't need a separate media type.

If we do this, we cut ourselves from the JSON-LD world.

ActivityStreams introduced the application/activity+json media type because there were additional processing steps required to deal with non-JSON-LD compatible content such as the in-lining of GeoJSON lists-of-lists.

The Web of Things WG is also defining a separate media type for their JSON format because they have a required transformation step to make it valid JSON-LD.

In both these cases, there is a foundational JSON format that requires a processing step in addition to or prior to consuming the content as "pure" JSON-LD. However, both specifications state that using their custom media type comes with the requirement of an implicit JSON-LD @context value.

The Verifiable Claims WG is likely to end up here also if there are processing requirements beyond the foundation of "pure" JSON-LD processing.

Media types are what signal how the contents of a response should be processed. Therefore, because the WPUB specification includes a processing the manifest algorithm--which goes beyond simply consuming the JSON as JSON or JSON-LD--then the WPUB manifest needs to have its own media type.

I believe both that Activity Stream and the WoT examples are different. What we define as 'authored manifest' is a bona fide JSON-LD. It is also a JSON-LD that abides (unless there are no terms, in which case we have to add our own) to schema.org.

What I meant by

If we do this, we cut ourselves from the JSON-LD world.

is that if we use a different media type, that means that our idiom for, e.g., the embedded manifest:

<script id="example_manifest" type="application/ld+json">
{
    …
}
</script>

has to use the new media type instead of application/ld+json, in which case schema.org processors will not extract the relevant information. I.e., we could not rely on schema.org for search, which was the reason to use JSON-LD in the first place. We could just as well drop JSON-LD altogether.

The 'canonical manifest' is also a bona fide JSON-LD; furthermore, a canonical manifest can also be used as an authored manifest (i.e., any 'canonical manifest' is a valid 'authored manifest'). The canonical manifest is more of an implementation specification tool, a way of specifying what exactly a WPUB processor must do to correctly interpret the data expressed in the authored manifest. It is not really exposed to the outside world except if the author decides to use a fully expanded canonical manifest in their authored manifest (which is valid).

I do not see what a separate media type would bring us.

P.S. If the media type spec allowed something like application/wpub+ld+json, then we could use that. Alas!, this is not allowed...

Actually, the canonicalization algorithm defines more precisely what, I would think, schema.org processors do behind the scenes, too. There are a number of terms in schema.org, for example, where the value can be a single value or an array of values and the processors use an array at the end of the day.

A possibility would be to look at the canonicalization algorithm could be expressed by some clever tricks in contexts, and relying on the output of the JSON-LD expansion and maybe framing algorithms. That may cover most of the canonicalization steps although there may be some features that could not be expressed that way and would therefore be pushed back into the definition of the authored manifest (making it a bit more complex to users). Although we never explored this in all details, such approach was pushed back in the past: the implementation feedback was that reading systems would not incorporate a full-blown JSON-LD processing (which is way more complicated), and it was better to spell out those portion that are relevant for this specification.

The 'canonical manifest' is also a bona fide JSON-LD; furthermore, a canonical manifest can also be used as an authored manifest (i.e., any 'canonical manifest' is a valid 'authored manifest').

It's true they are both "bona fide JSON-LD." However, the issue is that they are not the same JSON-LD--so you end up with potentially two distinct graph structures...and the canonical one has "injected" statements/triples.

Actually, the canonicalization algorithm defines more precisely what, I would think, schema.org processors do behind the scenes, too.

The Google Structured Data Testing Tool (at least) does modify the incoming graph and introduce non-JSON-LD expressed assumptions about what was expressed--i.e. everything becomes "type": "Thing", some properties/statements are "required," etc. All of that is beyond JSON-LD...and it's not (to my knowledge) actually defined or expressed anywhere officially...

The WPUB canonicalization process is indeed the same. It takes a valid JSON-LD input and turns it into different JSON-LD for internal processing (and possible output). So, while I agree that JSON-LD is a valid mime type for either of these "instantiations" of the manifest, I do believe there needs to be some way to signal that additional processing will be (and I believe in some cases MUST be) done before it's an actual WPUB manifest. All of that seems a bit beyond just "profiling" also...

A possibility would be to look at the canonicalization algorithm could be expressed by some clever tricks in contexts, and relying on the output of the JSON-LD expansion and maybe framing algorithms.

If framing is indeed an option and if it can be somehow expressed along with the media type--and in a way schema.org processors won't choke or ignore it, then we may have found a solution. 😃

the implementation feedback was that reading systems would not incorporate a full-blown JSON-LD processing (which is way more complicated),

An existing processing system is far less complicated than one that hasn't yet been written. 😁 So if a processing/altering/canonicalizing step is still required, we're in the same "require processing" boat...but now have a new set of processing that must take place (i.e. canonicalization) before that data can be considered fully valid/hydrated. So...again...we end up with two manifests...which don't (typically) match.

Lastly, your point about Schema.org and SEO is valid. However, since the manifest may not even be in the page, the SEO merits of it are suspect (at least if not embedded)--see also https://github.com/w3c/wpub/issues/327#issuecomment-473424746

Underlying concerns include:

feeding partial, incomplete graphs to SEO and "canonicalized"/differing data to Reading Systems
populating publishing knowledge graphs with two differing sets of statements/graphs for the same publication
implying that "authored manifests" are somehow equivalent to the post-processed/"canonicalized" variants

So, the summary then is that if the canonicalization process is required to properly interpret a manifest (i.e. the graph/data model output would be incorrect or insufficient), then authored manifests (at least) MUST have their own media type because they require additional, non-JSON-LD related processing.

Consequently, canonicalization begins to sound like something done by a tool pre-publication and not at consumption/run-time.

@BigBlueHat I would prefer we try to find some time in Cambridge (at the F2F) to discuss this, I think it would be more fruitful (it is a complex issue). @TzviyaSiegman @wareid @GarthConboy: can we do that?

Only one comment on your remark:

Lastly, your point about Schema.org and SEO is valid. However, since the manifest may not even be in the page, the SEO merits of it are suspect (at least if not embedded)--see also #327 (comment)

Note that there is a resolution of #327, which is reflected, in section 3.3.3 of the current draft:

It is RECOMMENDED to embed the manifest in the primary entry page.

You are right that this is not a MUST but almost. Which means that the SEO issue must be taken extremely seriously.

An area far from my expertise, but a slot on the F2F agenda SGTM... maybe I'll get smart.

This issue was discussed in a meeting.

RESOLVED: the rel=”publication” discovery mechanism will be what signals the need for canonicalization/processing
View the transcript
Manifest files need their own MIME Media Type
Wendy Reid: https://github.com/w3c/wpub/issues/409
: be discussed today or tomorrow.
Benjamin Young: https://http.cat/409
Benjamin Young: Create a mime type for manifest files
… have operational set of actions
… convert from authored manifest to canonical manifest
… user needs
… beyond json.parse
… beyond graph representation
… 2 expressed formats
… operationally different
… so if people implement canincialization process
… we need a new media type
… wpub + json or some such
… as activity streams people did
… beyond json-ld
… needed their own media type
… we should do the same for both authored and canonica
Ivan Herman: This is the issue about which we say “specification purity less important than good of community”
… the authored manifest; if not using LD + JSON media type
… then will be ignored by schema.org processors
… killing its raison d’etre
… should not touch MT
… could add profile
… for whatever reason
… we could decide to give a differnt MT to canonical manifest
… but CM can be used as AM
… same formate
… same data
… so should not be different MT
… strinctly speaking CM and AM have different RDF representations
… but that is specification purity
… backfire on practicality
… schema.org processors say something is a URI or stream
… we accept the lack of purity
… we should not touch
Benjamin Young: profile = does not solve the issue
… for schema.org; it is ignored
… JSON - LD.js going into Chrome lighthouse
… so they use json-ld going forward
… if we don’t go through some process
… they are equivalent in doc; but not really
… so pub has different states of meaning
… authored v consumed
… If Wiley takes Moby Dick as authored get one result
… through canonicalization has different meaning
… could do what schema.org does
… but how does an implementor know?
Tzviya Siegman: Is there a way to end the stalemate?
Ivan Herman: Say the authored manifest must use schema.org creative work
… or a subtype thereof
… could define a separate type and demand that it is added to the manifest
… we signal it is not just a creative work
… also a web pub
… needs canonicalization to get web pub features
… the type is an array of types
… AB, VBs
… this works and answers concerns
Laurent Le Meur: I fear it would be an abuse of the mechanism of context types in schema.org
… used to indicate properties within a structure
Ivan Herman: It’s an RDF type… no more
Tim Cole: schema.org defines additional type property
… can be used for this
… make sure schema.org understands
… could do an extension
… as long as not primary
Ivan Herman: A subtype of creative work?
Tim: An extension
… creative type by inheritance
… external vocabulary
Ivan Herman: It is a schema.org syntactic hack
Benjamin Young: This came from canonicalization
… not to express more
… VC has a processing model
… an intended use for data models
… equivalent to using json - ld parser
… but we have two types: AM and CM
… a consumer does not know what you have
… you are left wondering
… it may be a question of who runs canconicalization
… publisher does not want messy author thing
… we want a canonicalized thing
… developer won’t know
… will consume messy thing wrong
Ivan Herman: Your solution works in an ideal world
… too high for publishers
… need to lower the bar
… (except Wiley)
… there are self-publishers, etc.
… we want a simple manifest
… requiring caninical manifest not realistic
Wendy Reid: I don’t hear the conclusion
… can the paricipants work it out?
Benjamin Young: Developers can be smaller than Wiley
… but the technology does not say when to use CM
… signal what processing to do
… today; nothing that distinguishes
… no clarity about process
… different from structured data testing tool
… need to signal when to execute
George Kerscher: Does a wpub check resolve this problem?
Benjamin Young: “The tools will save us”
Matt Garrish: You always run the canonicalization
… but maybe nothing to do to AM if everything is already there
… don’t bypass
Ivan Herman: Can clarify doc to say “when reading system turns AM into abstracted web idl, in that process it canonicalizes the manifest and converts to JS classes”
… if AM is complete, then canonicalization is the empty set
Benjamin Young: The way you know to run that is linkrel
… SEO bots will get something different
… the AM output
… which will differ and may not be found
Romain Deltour: To George’s question
… epub checking very different
… on web, content is not validated
… don’t require valid content
… so future web pub checker cannot be used this way
… just a lint
… user agents won’t request content
George Kerscher: You can require consistency
Romain Deltour: But you can have content fail; OK for the web
Wendy Reid: Do not see consensus
… need working the issue + referee
Benjamin Young: Ivan and Matt have pointed out that canonicalization only targets wpub processors
… they are looking for rel relationship and and abstracting it
… so we are ok
… seo bots and post processors will be confused, but that’s ok
… we can close the issue
… do not need a media type
Wendy Reid: Can you formalize that proposal
… (Issue #44 is for tomorrow)
Proposed resolution: the rel=”publication” discovery mechanism will be what signals the need for canonicalization/processing (Benjamin Young)
Ivan Herman: +1
Wendy Reid: +1
Tim Cole: +1
Matt Garrish: +1
Marisa DeMeglio: +1
David Stroup: +1
Tzviya Siegman: +1
Deborah Kaplan: +1
Benjamin Young: +1
Nellie McKesson: +1
Dave Cramer: just make it stop!
Rachel Comerford: +1
Romain Deltour: +1
Resolution #1: the rel=”publication” discovery mechanism will be what signals the need for canonicalization/processing
Wendy Reid: So resolved

w3c / wpub

Manifest files need their own MIME Media Type (because canonicalization) #409