Open rvagg opened 1 year ago
I prefer to use CSV, and think it is simpler for many cases. (I do use the CSV myself.) However, it is also true that in some programming langauges, JSON will be better, so having both can be helpful.
One problem with using JSON for contribution is the lack of a trailing comma, meaning that the records do not all have the same format. Another note about JSON is that JSON does not properly have 64-bit integers; although JavaScript has a integer type now, JSON predates the integer type in JavaScript.
If JSON is implemented, ensure that the CSV is generated in the same format it is now (including spacing, although a column may be made wider if necessary), possibly adding additional columns at the end if such a thing is necessary (which it might or might not be). The CSV should be plain ASCII (do not use any non-ASCII characters), and should not have any quotation marks; if you convert JSON to CSV, ensure that this is the case, even though the JSON will definitely include quotation marks, and possibly non-ASCII characters. (If it includes long descriptions, they will need to somehow be truncated, or maybe it is better to have a separate field for a short description.)
If you do want to switch to JSON like that, you will first need to convert CSV to JSON first, so that it will all be JSON, and then you can convert JSON to CSV. (Then, you can check that the output matches the original, that the converter does not have a bug. Fortunately, you have a version control system, so you can revert it if something goes wrong.)
The program will have to check that it is valid. If the JSON includes the varint code (as suggested in #297), then it will have to check that the varint code matches the numeric code. This should be easily enough to implement.
(Some programs might use the numeric codes directly, such as the hash.c
and hash.h
files in Free Hero Mesh, where they are used for defining constants with names such as HASH_SHA3_256
(defined as 0x16) and passed as arguments to function calls. Passing numbers will be more efficient than passing strings to identify hash algorithms where they need to specified (in a function call or something else), and using multicodec numbers will allow the same numbers to be used in many programs, in case that is helpful sometimes. So, both numeric codes and varint codes will be in use.)
2023-01-03 IPLD triage conversation: @mriise : just curious if this would be something you'd be interested in picking up?
Here's a rough outline that in my head from the various evolved conversations about this:
"ref"
field for links to specs):[
{ "name": "identity", "tag": "multihash", "code": "0x00", "varint": "0", "status": "permanent", "description": "raw binary" },
{ "name": "cidv1", "tag": "cid", "code": "0x01", "varint": "1", "status": "permanent", "description": "CIDv1", "ref": [ "https://github.com/multiformats/cid" ] },
{ "name": "cidv2", "tag": "cid", "code": "0x02", "varint": "2", "status": "draft", "description": "CIDv2", "ref": [ "https://github.com/multiformats/cid" ] },
{ "name": "cidv3", "tag": "cid", "code": "0x03", "varint": "3", "status": "draft", "description": "CIDv3", "ref": [ "https://github.com/multiformats/cid" ] },
{ "name": "ip4", "tag": "multiaddr", "code": "0x04", "varint": "4", "status": "permanent" }
]
That looks like to me that it can work, that it seems good enough (the lack of a trailing comma is a bit messy because now the records do not all have the same format, although I suppose there is not a satisfactory way to avoid that). The reference seems a good idea, too.
However, please ensure that the CSV file contains no quotation marks, no commas within the data of a field (commas are only between fields), no non-ASCII characters. (If the JSON contains strings with commas, quotation marks, and/or non-ASCII characters, which would appear in the CSV file, then the conversion would need to somehow change them, when it is being converted to CSV.)
Also, how are those fields encoded? The varint
field just stores a single digit number, even though it seems that varint encoding should be encoded just as a sequence of hex codes, e.g. 808001
for the number 16384 (according to the examples in the varint specification)).
One thing to note, might be useful to use ndjsonfor streaming parsing.
maybe? ndjson is nice, but makes it pretty inconvenient for just loading the whole lot
I quite like streaming parsing and it might be neat if this gets huge but perhaps consumers would get annoyed they can't just dump it through a standard JSON parser as it is.
One nice thing ndjson would give us is strict enforcement of one-entry-per-line, which is what I'd like to see.
Initial proposal for some feedback: https://github.com/multiformats/multicodec/pull/311
I can't find the issue(s) where we discussed this but there's been a proposal on the table for a while that we represent the multicodec table as JSON for easier consumption and more flexibility. I'd like to switch it so that the CSV is generated from the JSON and people contribute to the JSON. A JSON table would let us add more items, like https://github.com/multiformats/multicodec/issues/297, and longer descriptions, and overall much more flexibility for entries and ease of downstream consumption.
This needs a PR to propose the format. Does it need to be line-delimited JSON? Can it be fully pretty-printed? What does linting look like? What does CSV generation look like?
Anyone want to give this a go?