multiformats / multicodec

Compact self-describing codecs. Save space by using predefined multicodec tables.
MIT License
334 stars 200 forks source link

Add JSON form of the table #305

Open rvagg opened 1 year ago

rvagg commented 1 year ago

I can't find the issue(s) where we discussed this but there's been a proposal on the table for a while that we represent the multicodec table as JSON for easier consumption and more flexibility. I'd like to switch it so that the CSV is generated from the JSON and people contribute to the JSON. A JSON table would let us add more items, like https://github.com/multiformats/multicodec/issues/297, and longer descriptions, and overall much more flexibility for entries and ease of downstream consumption.

This needs a PR to propose the format. Does it need to be line-delimited JSON? Can it be fully pretty-printed? What does linting look like? What does CSV generation look like?

Anyone want to give this a go?

zzo38 commented 1 year ago

I prefer to use CSV, and think it is simpler for many cases. (I do use the CSV myself.) However, it is also true that in some programming langauges, JSON will be better, so having both can be helpful.

One problem with using JSON for contribution is the lack of a trailing comma, meaning that the records do not all have the same format. Another note about JSON is that JSON does not properly have 64-bit integers; although JavaScript has a integer type now, JSON predates the integer type in JavaScript.

If JSON is implemented, ensure that the CSV is generated in the same format it is now (including spacing, although a column may be made wider if necessary), possibly adding additional columns at the end if such a thing is necessary (which it might or might not be). The CSV should be plain ASCII (do not use any non-ASCII characters), and should not have any quotation marks; if you convert JSON to CSV, ensure that this is the case, even though the JSON will definitely include quotation marks, and possibly non-ASCII characters. (If it includes long descriptions, they will need to somehow be truncated, or maybe it is better to have a separate field for a short description.)

If you do want to switch to JSON like that, you will first need to convert CSV to JSON first, so that it will all be JSON, and then you can convert JSON to CSV. (Then, you can check that the output matches the original, that the converter does not have a bug. Fortunately, you have a version control system, so you can revert it if something goes wrong.)

The program will have to check that it is valid. If the JSON includes the varint code (as suggested in #297), then it will have to check that the varint code matches the numeric code. This should be easily enough to implement.

(Some programs might use the numeric codes directly, such as the hash.c and hash.h files in Free Hero Mesh, where they are used for defining constants with names such as HASH_SHA3_256 (defined as 0x16) and passed as arguments to function calls. Passing numbers will be more efficient than passing strings to identify hash algorithms where they need to specified (in a function call or something else), and using multicodec numbers will allow the same numbers to be used in many programs, in case that is helpful sometimes. So, both numeric codes and varint codes will be in use.)

BigLep commented 1 year ago

2023-01-03 IPLD triage conversation: @mriise : just curious if this would be something you'd be interested in picking up?

rvagg commented 1 year ago

Here's a rough outline that in my head from the various evolved conversations about this:

[
  { "name": "identity", "tag": "multihash", "code": "0x00", "varint": "0", "status": "permanent", "description": "raw binary" },
  { "name": "cidv1",    "tag": "cid",       "code": "0x01", "varint": "1", "status": "permanent", "description": "CIDv1",     "ref": [ "https://github.com/multiformats/cid" ] },
  { "name": "cidv2",    "tag": "cid",       "code": "0x02", "varint": "2", "status": "draft",     "description": "CIDv2",     "ref": [ "https://github.com/multiformats/cid" ] },
  { "name": "cidv3",    "tag": "cid",       "code": "0x03", "varint": "3", "status": "draft",     "description": "CIDv3",     "ref": [ "https://github.com/multiformats/cid" ] },
  { "name": "ip4",      "tag": "multiaddr", "code": "0x04", "varint": "4", "status": "permanent" }
]
zzo38 commented 1 year ago

That looks like to me that it can work, that it seems good enough (the lack of a trailing comma is a bit messy because now the records do not all have the same format, although I suppose there is not a satisfactory way to avoid that). The reference seems a good idea, too.

However, please ensure that the CSV file contains no quotation marks, no commas within the data of a field (commas are only between fields), no non-ASCII characters. (If the JSON contains strings with commas, quotation marks, and/or non-ASCII characters, which would appear in the CSV file, then the conversion would need to somehow change them, when it is being converted to CSV.)

Also, how are those fields encoded? The varint field just stores a single digit number, even though it seems that varint encoding should be encoded just as a sequence of hex codes, e.g. 808001 for the number 16384 (according to the examples in the varint specification)).

RangerMauve commented 1 year ago

One thing to note, might be useful to use ndjsonfor streaming parsing.

https://www.json-to-ndjson.app/

rvagg commented 1 year ago

maybe? ndjson is nice, but makes it pretty inconvenient for just loading the whole lot

I quite like streaming parsing and it might be neat if this gets huge but perhaps consumers would get annoyed they can't just dump it through a standard JSON parser as it is.

One nice thing ndjson would give us is strict enforcement of one-entry-per-line, which is what I'd like to see.

rvagg commented 1 year ago

Initial proposal for some feedback: https://github.com/multiformats/multicodec/pull/311