netwerk-digitaal-erfgoed / requirements-datasets

Requirements for datasets
https://netwerk-digitaal-erfgoed.github.io/requirements-datasets/
1 stars 0 forks source link

Describe how to handle compressed distributions #68

Closed coret closed 1 year ago

coret commented 1 year ago

ref: https://twitter.com/markuitheiloo/status/1554838166174965761

When a distribution is gzipped, the schema:encodingFormat one could opt for is application/gzip. But this obfuscated the real content-type. HTTP responses could be gzipped when client and server can handle this, without the need for the application/gzip response type. We advise to use a content-type described the contents of the (compressed) file, such as text/turtle, application/rdf+xml, etc.

Todo: add advise to requirements.

EnnoMeijers commented 1 year ago

ref: https://twitter.com/markuitheiloo/status/1554848093593640962 Mark suggested to add dcat:compressFormat. Schema.org mentions the following approach: For the case of a single file published after Zip compression, the convention of appending '+zip' to the [[encodingFormat]] can be used.

coret commented 1 year ago

Suggest adopting the Schema.org approach and add the following to the specification part of schema:encodingFormat:

When the distribution is compressed, the compression format (eg. zip, gzip, rar) should be added to the schema:encodingFormat (eq. text/turtle+gzip).

Note: also include in example.

bencomp commented 1 year ago

Doesn't that go against the 'rules' for media types? I believe you should be able to add specificity by inserting xxx+ directly after the /, like application/ld+json to note that something isn't just JSON. Adding +gzip to the end contradicts this. I don't see this suggested at https://schema.org/encodingFormat either.

I would suggest to follow the DCAT2 spec and use a separate property to indicate compression format next to the file format.

coret commented 1 year ago

The +zip (though not +gzip) is suggested in RDF 6839 - Additional Media Type Structured Syntax Suffixes.

Some examples to show possible solutions.

# format is correct, this is a gzip, but this is the envelop, we're interested in the nt part
[] a schema:DataDownload
  schema:contentUrl: "https://www.openarch.nl/exports/nt/files/gld-20220726.nt.gz" ;
  schema:encodingFormat":"application/gzip" .

# is this valid? maybay hard for machine to understand...
[] a schema:DataDownload 
  schema:contentUrl: "https://www.openarch.nl/exports/nt/files/gld-20220726.nt.gz" ;
  schema:encodingFormat": ["application/gzip", "application/gzip" ] .

# example based on the +gzip addition
[] a schema:DataDownload 
  schema:contentUrl: "https://www.openarch.nl/exports/nt/files/gld-20220726.nt.gz" ;
  schema:encodingFormat":"application/n-triples+gzip" .

# example of using a dcat property (I'm not a fan of mixing schema.org/Dataset and DCAT)
[] a schema:DataDownload
  schema:contentUrl: "https://www.openarch.nl/exports/nt/files/gld-20220726.nt.gz" ;
  schema:encodingFormat":"application/n-triples" ;
  dcat:compressFormat: "application/gzip" .
bencomp commented 1 year ago

I stand corrected! The RFC indeed specifies such suffixes and the associated Structured Syntax Suffixes registry lists +gzip too.

The third example would be the clearest (now that I know about the standards). The only question I have left is how to encode (g)zipped JSON-LD files? application/ld+json+gzip?