multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
337 stars 201 forks source link

Codec proposal: N-Quads (RDF format) #180

Open joeltg opened 4 years ago

joeltg commented 4 years ago

Not sure if this is the right way to bring this up, but I'd like to propose adding a codec for N-Quads files. RDF is the graph data model for the semantic web, and although N-Quads is just one of many RDF serializations, it's commonly regarded as the lowest-level representation with the most regular structure and the least syntactic sugar.

In particular, N-Quads is the output format of the Universal Dataset Normalization Algorithm (URDNA2015) (also brought up in this issue). URDNA2015 is a big deal for the RDF world because it produces a canonical representation (ie two isomorphic datasets will produce the exact same serialized N-Quads string) that is required for all the digital signatures work that's starting to happen, and it's a representation that people will commonly want to hash!

This would also enable a natural interpretation of RDF datasets as IPLD objects, using an IPLD schema for the RDFJS data model with N-Quads as a custom representation.

I see this as a great concrete foundation for bringing the semantic web & decentralized web communities closer together. Is this the kind of codec we're open to adding? Would it be appropriate to open a pull request to table.csv?

joeltg commented 4 years ago

IPLD <-> RDF interop has also been discussed in a few times in the past, without concrete results:

rvagg commented 4 years ago

I suspect @mikeal and @vmx will have more mature thoughts about RDF than me, but I'd say that in general multicodec can be used to disambiguate types of objects where any such ambiguity exists. It's not strictly tied to IPLD, although IPLD is a logical consumer of multicodecs. Where something is being transmitted or stored and you want to ensure clarity about what type of thing it is, multicodec should be helpful.

So with that in mind, if you have a use-case where that's applicable, IPLD or not, then an entry in the multicodec table would be a good thing. My preference would be to be adding things where there are concrete examples of them existing in the wild where multicodec could be applied, or at least concrete plans on how they could be applied, but we're taking a fairly relaxed approach to that lately and the idea of explicitly labelling things as "draft" for this purpose is on the cards: https://github.com/multiformats/multicodec/pull/165

Do you see a path to this being used any time soon, or is would this be more a symbolic move for now by saying that multicodec & RDF have potential connectivity?

joeltg commented 4 years ago

I know that I'd use it right away! For the Underlay we're currently storing and referencing lots of N-Quads files as raw objects - including linking to N-Quads files from other N-Quads files using a dweb:/ipld/ URI format (all identifiers in RDF are URIs). One use case we'd really like to pull off is using CAR archives (or something similar) to collect and package all transitively linked files, so we want to be able to tell whether a CID is an N-Quads file, and we want IPLD to know how to traverse its links.

mikeal commented 4 years ago

Is there utility you’d get out of an IPLD representation beyond raw though? My understanding is that links in this format are not addressed by hash, so there’s no way to represent them as links in IPLD, so you’re never actually going to get a graph for this format even if there’s a codec.

The only thing a codec would give you is a Data Model (for this it would just be JSON types) representation of the file format, but you’d have to ensure the serialized representation is kept below the block size limit (1mb) which is going to be hard since you don’t have a way to link between the blocks in IPLD to handle N-Quad files that are larger than the limit because it doesn’t link by hash.

That said, if you can get some utility out of it there’s no real barrier to adding the codec as long as we document these constraints, I’d just caution against using it if you’re going to be encoding large data structures this way.

joeltg commented 4 years ago

Is there utility you’d get out of an IPLD representation beyond raw though?

Yes! It would give us a way of referencing individual quads in a dataset (using integer index paths), which we want to do for tracing provenance. There's no widely accepted method for doing this in the RDF world right now.

You're right that the graph structure (what nodes are connected by what edges) won't be directly represented in IPLD - but it couldn't if we tried, since RDF is a directed labelled multigraph (ie possibly containing cycles).

I understand that codecs are a different abstraction level than the IPLD data model, and that there would have to be different representation strategies for 1mb+ datasets, but I still see this as having real utility as a building block for people working to decentralize RDF.

jonnycrunch commented 4 years ago

@joeltg I went down the rdf over ipld and ran into the fact that rdf graphs contain cycles and thus wouldn't be a good fit for IPLD.

joeltg commented 4 years ago

@jonnycrunch the IPLD data model representation of an N-Quads file wouldn't represent the dataset "directly" by having nodes be maps and edges be keys like in JSON-LD, it would represent the dataset at the lower-level RDFJS Data Model, as a flat array of quads.

IPLD data model stuff could be its own conversation; this issue is just about getting an N-Quads multicodec.

vmx commented 4 years ago

Multicodecs describe a lot. We started to put them into categories. One of them is "ipld" to describe codecs that make sense within the IPLD ecosystem. I don't think it's written down anywhere, but I think formats in that category need to support at least Links. Obviously that's not the case for N-Quads.

So we could put it into another category. Then it would be just an identifier of how things are encoded. I think it would be OK to add such a code, but I it won't add much value to IPLD. IPLD might link to an N-Quad, but that would always be the end of the traversal (a sink), just like the raw codec.

OR13 commented 4 years ago

This is very interesting... I did some related CBOR work here:

https://github.com/transmute-industries/decentralized-cbor

in particular, I represent ZLIB_Compressed_NQuads as CBOR... providing compressed representation for JSON-LD with bi-directional transformation between CBOR and JSON-LD....

There is also work in progress of CBOR-LD as well.... (and obviously DAG_CBOR which powers IPLD).

I agree with vmx, N-Quads are the end of pure IPLD, but here is nothing stoping your from leaving IPLD and following them further... for example, across DIDs or URIs in the N-Quads...

IPLD1 -> IPLD2 -> NQuads  -> did:sov:123
                          -> did:ethr:456
                          -> https://public.oracle.example.com/credentials/123
                          -> https://ipfs.io/CID
                          -> IPLD3       

Some DIDs rely on multicodec already like did:key, and obviously any IRI in an N-Quad might rely on multicodec as well.