w3c / wpub

W3C Web Publications
https://w3c.github.io/wpub/
Other
78 stars 19 forks source link

Manifest files need their own MIME Media Type (because canonicalization) #409

Closed BigBlueHat closed 5 years ago

BigBlueHat commented 5 years ago

There is additional processing (beyond JSON, beyond JSON-LD) required for generating a canonical manifest. Consequently, a new MIME media type is needed to signal that this unique processing is required.

Additionally, the processing steps required for canonicalization alter the meaning of the original JSON-LD. The result is a different graph containing new statements from those in the authored manifest, so a simple JSON-LD "profile" would be insufficient (as profiles can't alter the encoded document).

So, something like application/wpub+json would be best as it signals the foundational format (JSON) as well as the short name of the W3C spec which describes the required processing steps.

The WPUB spec could also declare the default @context value for the media type as done by ActivityStreams in their IANA submission.

This also has the benefit of opening the door to profiling the manifest format more clearly via the profile parameter with a value pointing to an extended context or new definition document.

iherman commented 5 years ago

This would create significant problems. The JSON-LD spec requires the application/ld+json in the case of an embedded manifest. Schema.org follows this rule (rightfully so) and, for example, the Structured data testing tool fails to accept a script element with any other mime type than application/ld+json.

If we do this, we cut ourselves from the JSON-LD world.

BigBlueHat commented 5 years ago

Schema.org does not currently define an additional processing model beyond JSON-LD, so they don't need a separate media type.

If we do this, we cut ourselves from the JSON-LD world.

ActivityStreams introduced the application/activity+json media type because there were additional processing steps required to deal with non-JSON-LD compatible content such as the in-lining of GeoJSON lists-of-lists.

The Web of Things WG is also defining a separate media type for their JSON format because they have a required transformation step to make it valid JSON-LD.

In both these cases, there is a foundational JSON format that requires a processing step in addition to or prior to consuming the content as "pure" JSON-LD. However, both specifications state that using their custom media type comes with the requirement of an implicit JSON-LD @context value.

The Verifiable Claims WG is likely to end up here also if there are processing requirements beyond the foundation of "pure" JSON-LD processing.

Media types are what signal how the contents of a response should be processed. Therefore, because the WPUB specification includes a processing the manifest algorithm--which goes beyond simply consuming the JSON as JSON or JSON-LD--then the WPUB manifest needs to have its own media type.

iherman commented 5 years ago

I believe both that Activity Stream and the WoT examples are different. What we define as 'authored manifest' is a bona fide JSON-LD. It is also a JSON-LD that abides (unless there are no terms, in which case we have to add our own) to schema.org.

What I meant by

If we do this, we cut ourselves from the JSON-LD world.

is that if we use a different media type, that means that our idiom for, e.g., the embedded manifest:

<script id="example_manifest" type="application/ld+json">
{
    …
}
</script>

has to use the new media type instead of application/ld+json, in which case schema.org processors will not extract the relevant information. I.e., we could not rely on schema.org for search, which was the reason to use JSON-LD in the first place. We could just as well drop JSON-LD altogether.

The 'canonical manifest' is also a bona fide JSON-LD; furthermore, a canonical manifest can also be used as an authored manifest (i.e., any 'canonical manifest' is a valid 'authored manifest'). The canonical manifest is more of an implementation specification tool, a way of specifying what exactly a WPUB processor must do to correctly interpret the data expressed in the authored manifest. It is not really exposed to the outside world except if the author decides to use a fully expanded canonical manifest in their authored manifest (which is valid).

I do not see what a separate media type would bring us.

P.S. If the media type spec allowed something like application/wpub+ld+json, then we could use that. Alas!, this is not allowed...

iherman commented 5 years ago

Actually, the canonicalization algorithm defines more precisely what, I would think, schema.org processors do behind the scenes, too. There are a number of terms in schema.org, for example, where the value can be a single value or an array of values and the processors use an array at the end of the day.

A possibility would be to look at the canonicalization algorithm could be expressed by some clever tricks in contexts, and relying on the output of the JSON-LD expansion and maybe framing algorithms. That may cover most of the canonicalization steps although there may be some features that could not be expressed that way and would therefore be pushed back into the definition of the authored manifest (making it a bit more complex to users). Although we never explored this in all details, such approach was pushed back in the past: the implementation feedback was that reading systems would not incorporate a full-blown JSON-LD processing (which is way more complicated), and it was better to spell out those portion that are relevant for this specification.

BigBlueHat commented 5 years ago

The 'canonical manifest' is also a bona fide JSON-LD; furthermore, a canonical manifest can also be used as an authored manifest (i.e., any 'canonical manifest' is a valid 'authored manifest').

It's true they are both "bona fide JSON-LD." However, the issue is that they are not the same JSON-LD--so you end up with potentially two distinct graph structures...and the canonical one has "injected" statements/triples.

Actually, the canonicalization algorithm defines more precisely what, I would think, schema.org processors do behind the scenes, too.

The Google Structured Data Testing Tool (at least) does modify the incoming graph and introduce non-JSON-LD expressed assumptions about what was expressed--i.e. everything becomes "type": "Thing", some properties/statements are "required," etc. All of that is beyond JSON-LD...and it's not (to my knowledge) actually defined or expressed anywhere officially...

The WPUB canonicalization process is indeed the same. It takes a valid JSON-LD input and turns it into different JSON-LD for internal processing (and possible output). So, while I agree that JSON-LD is a valid mime type for either of these "instantiations" of the manifest, I do believe there needs to be some way to signal that additional processing will be (and I believe in some cases MUST be) done before it's an actual WPUB manifest. All of that seems a bit beyond just "profiling" also...

A possibility would be to look at the canonicalization algorithm could be expressed by some clever tricks in contexts, and relying on the output of the JSON-LD expansion and maybe framing algorithms.

If framing is indeed an option and if it can be somehow expressed along with the media type--and in a way schema.org processors won't choke or ignore it, then we may have found a solution. 😃

the implementation feedback was that reading systems would not incorporate a full-blown JSON-LD processing (which is way more complicated),

An existing processing system is far less complicated than one that hasn't yet been written. 😁 So if a processing/altering/canonicalizing step is still required, we're in the same "require processing" boat...but now have a new set of processing that must take place (i.e. canonicalization) before that data can be considered fully valid/hydrated. So...again...we end up with two manifests...which don't (typically) match.

Lastly, your point about Schema.org and SEO is valid. However, since the manifest may not even be in the page, the SEO merits of it are suspect (at least if not embedded)--see also https://github.com/w3c/wpub/issues/327#issuecomment-473424746

Underlying concerns include:

So, the summary then is that if the canonicalization process is required to properly interpret a manifest (i.e. the graph/data model output would be incorrect or insufficient), then authored manifests (at least) MUST have their own media type because they require additional, non-JSON-LD related processing.

Consequently, canonicalization begins to sound like something done by a tool pre-publication and not at consumption/run-time.

iherman commented 5 years ago

@BigBlueHat I would prefer we try to find some time in Cambridge (at the F2F) to discuss this, I think it would be more fruitful (it is a complex issue). @TzviyaSiegman @wareid @GarthConboy: can we do that?

Only one comment on your remark:

Lastly, your point about Schema.org and SEO is valid. However, since the manifest may not even be in the page, the SEO merits of it are suspect (at least if not embedded)--see also #327 (comment)

Note that there is a resolution of #327, which is reflected, in section 3.3.3 of the current draft:

It is RECOMMENDED to embed the manifest in the primary entry page.

You are right that this is not a MUST but almost. Which means that the SEO issue must be taken extremely seriously.

GarthConboy commented 5 years ago

An area far from my expertise, but a slot on the F2F agenda SGTM... maybe I'll get smart.

iherman commented 5 years ago

This issue was discussed in a meeting.