Mapping CDXJ <-> NDJSON

oduwsdl / ORS

Object Resource Stream and CDXJ Drafts

15 stars 2 forks source link

Mapping CDXJ <-> NDJSON #3

Open ikreymer opened 9 years ago

ikreymer commented 9 years ago

Since there are already parsers for newline-delimited JSON, it may be useful to map CDXJ to this format and vice versa. It would be useful to have a specific one-to-one mapping from CDXJ to JSON lines.

For example, pywb cdx server already supports an output=json (and soon output=cdxj), which can return the same data in JSON.

A CDXJ line looks like this:

com,google)/ 20150125034709 {"url": "http://www.google.com/", "digest": "S3K4ZKZALJ4DB4RL2IQ6D233IW7XXLVO", "length": "7674", "offset": "725344446", "filename": "common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422118059355.87/warc/CC-MAIN-20150124164739-00123-ip-10-180-212-252.ec2.internal.warc.gz"}

while an equivalent JSON line currently looks like this:

{"urlkey": "com,google)/", "timestamp": "20150125034709", "url": "http://www.google.com/", "length": "7674", "filename": "common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422118059355.87/warc/CC-MAIN-20150124164739-00123-ip-10-180-212-252.ec2.internal.warc.gz", "digest": "S3K4ZKZALJ4DB4RL2IQ6D233IW7XXLVO", "offset": "725344446"}

This can be done with the following restrictions:

The key part of CDXJ is converted to a JSON key urlkey
urlkey can not otherwise be used in the JSON dictionary in CDXJ
@context, @id, @keys are added to the JSON the same way, and can not be used in JSON dict of CDXJ

Perhaps instead of urlkey, it should be @urlkey or @key or some other clearly defined name...

This would then allow for unambiguously converting from JSON back to CDXJ, if needed.

ibnesayeed commented 9 years ago

Although, this is application specific issue and has nothing to do with the format and semantic, but we can still discuss it here which might help us shape the format better. I would say, if the knowledge of the key fields is present as an application out-of-band context then there is no issue in transforming them back and forth. Here are a few points in this regard:

The application that is generating or consuming the data is aware of the context and knows about the global key fields, so there should be no issue of collision. If it can generate data in JSON, then a few fields of that data are used as keys and they will probably not be reused anyway. Additionally, their information is present in the meta section.
CDXJ does not enforce removal of the fields from the data portion that are used as prefix keys. It is fine to have duplicates there in a way that the NDJSON style data remains as it is and only a few attribute values are are copied again as prefixes. It is entirely on the application, what the value block has irrespective of what goes in the prefix keys.

So, the above example can very well be written as:

@keys ["urlkey", "timestamp"]
com,google)/ 20150125034709 {"urlkey": "com,google)/", "timestamp": "20150125034709", "url": "http://www.google.com/", "length": "7674", "filename": "common-crawl/crawl-data/CC-MAIN-2015-06/segments/1422118059355.87/warc/CC-MAIN-20150124164739-00123-ip-10-180-212-252.ec2.internal.warc.gz", "digest": "S3K4ZKZALJ4DB4RL2IQ6D233IW7XXLVO", "offset": "725344446"}

ikreymer commented 9 years ago

Yes, you're right that there is no requirement to remove those entries from the JSON dict, but would very likely be desirable for any application to save space.

And you're right that its very application specific. I think perhaps be useful to think of CDXJ as a transformation of NDJSON with a designated sort key.

For example, given {"surt": "com,example)/", "status": "200", "timestamp": "2015"}, one could create any of these CDXJs:

@ {"keys": ["url", "timestamp"]}
com,example)/ 2015 {"status": "200"}

@ {"keys": ["status"]}
200 {"surt": "com,example)/", "timestamp": "2015"}

@ {"keys": ["timestamp", "status"]}
2015 200 {"surt": "com,example)/"}

and so on...

Hmm.. this makes me think that there should be an enforcement of no duplication, eg. if "timestamp" is used as a key it can't be a value, since JSON does not allow duplicate keys in a dict...

ikreymer commented 9 years ago

Yeah, the more I think about it, the more it makes sense to think of this format as a transformation on NDJSON.. To that effect, backwards compatibility to NDJSON may be desirable, as much as its possible..

Instead of @, what if there was just whitespace to indicate the 'meta' keys..

 {"keys": ["surt", "timestamp"]}
 {"values": ["status"]}
 {}
com,example)/ 2015 {"status": "200"}
com,example)/ 2016 {"status": "404", "other": "data"}

This way, a regular NDJSON parser can read the first two line. The {} line can be used as an optional separator to indicate end of header and beginning of the data. (Not sure if this is really needed).

This data has a clear and unambiguous equivalent in NDJSON form,

{"keys": ["surt", "timestamp"]}
{"values": ["status"]}
{"surt": "com,example)/", "timestamp": "2015", "status": "200"}
{"surt": "com,example)/", "timestamp": "2016", "status": "404", "other": "data"}

Original CDX already has precedent of using a space to indicate a format header, so not too surprising.

ibnesayeed commented 9 years ago

I would strongly oppose using leading spaces to signify the special meta blocks as the boundary spaces are prone to damage and should not be relied upon. Also, making just the meta portion NDJSON compatible does not buy us anything because rest of the document will not be compatible anyway. Also, having an empty object as separator is not useful, for two reasons; 1) we don't really need a separating line if we have a way to identify meta lines using @-keys or leading spaces as you propose here, and 2) Sorting the file may misplace the separator or bring other similar entries from the data portion if there are any empty blocks with leading space (which is a perfectly valid data).

Currently used @-keys don't seems to have an issue that we are trying to fix. They offer predictable sorting and allow a mechanism to arbitrary split and merge in the metadata which may be very useful when data from different sources are combined and metadata needs to be merged. Here is the relevant section of the blog post that talks about it:

The @meta entries describe the aboutness of the resource and other metadata. Multiple entries of the same special keys (that start with an @ sign) should be merged at the time of consuming the document. Splitting them in multiple lines increases the readability and eases the process of updates. This means the two @meta lines can be combined in a single line or split into three different lines each holding "name", "year", and "updated_at" separately. The policy to resolve the conflicts in names when merging such entries should be defined per key basis as suitable. These policies could be "skip", "overwrite", "append" (specially for the values that are arrays), or some other function to derive new value(s).

And here are lines from the example it refers to:

@meta {"name": "Internet Archive", "year": 1996}
@meta {"updated_at": "2015-09:03T13:27:52Z"}

ikreymer commented 9 years ago

I am thinking of ways to reduce the spec to the very minimum, separating data from metadata. I see your point about empty data field, though..

How about just @meta field as the only special field.. everything else is data. For linked data stuff, you can use:

@meta {"context": "...", "id": "..."}

Also, I think that the JSON value data should only be a dict {}, so only @meta {} and no @meta []

If the value is a list, then it breaks the NDJSON equivalency I mentioned above, which I'd like to maintain. A JSON list can be represented as a single value a dict anyway.

Alternatively, perhaps any field that starts with @ is a metadata field, and the rest is application dependent..

That way, you can have support for @context, @id, in CDXJ-LD which could be a subset of CDXJ.

ibnesayeed commented 9 years ago

Alternatively, perhaps any field that starts with @ is a metadata field, and the rest is application dependent..

This is exactly what is intended and described in the blog post. ORS reserves @id and @context special keys (but not mandatory to have in each file), but any key that starts with @ and is not quoted is considered a special key. Multiple entries of the same same special key should be merged at consumption time. ORS derivatives can use this feature to encode many restrictions, extensions, and semantics. An example is presented by CDXJ extension that defines special semantics for @keys entry. We can discuss what special keys are generic enough to be escalated to ORS.

ibnesayeed commented 9 years ago

If the value is a list, then it breaks the NDJSON equivalency I mentioned above, which I'd like to maintain.

I don't see a compelling reason why would we struggle to keep some sort of NDJSON equivalency. NDJSON does not have metadata provision and our data portion is way more flexible than NDJSON.

ikreymer commented 9 years ago

I don't see a compelling reason why would we struggle to keep some sort of NDJSON equivalency. NDJSON does not have metadata provision and our data portion is way more flexible than NDJSON.

I think the main compelling reason for this format (and why someone would use this over NDJSON) is the prefix sorting capability. If the sorting is not needed, it is much better to use an existing format like NDJSON. There are several existing tools that work with JSON data, including NDJSON. https://en.wikipedia.org/wiki/Line_Delimited_JSON#Software_that_supports_Line_Delimited_JSON

jq in particular: https://stedolan.github.io/jq/ is well established and provides various unix-like tools for JSON data, including newline delimited.

If anyone is going to be doing any custom processing with CDXJ, the easiest solution is to convert it back to NDJSON and then pass it to an existing tool.

The metadata fields can be filtered out as needed by such tools as well

ibnesayeed commented 9 years ago

I think the main compelling reason for this format (and why someone would use this over NDJSON) is the prefix sorting capability. If the sorting is not needed, it is much better to use an existing format like NDJSON. There are several existing tools that work with JSON data, including NDJSON. https://en.wikipedia.org/wiki/Line_Delimited_JSON#Software_that_supports_Line_Delimited_JSON

Sorting is not the only compelling reason, filtering, grouping, distributing (such as using MapReduce) data some other examples that can be done more efficiently in ORS than making each line a valid JSON. Pushing the prefix keys back inside the JSON block kills the purpose of this format. NDJSON may be good and "well" supported as the authors advertise it, but it has it's limitations where it just can't be used. On the other hand there are many tools that use ORS-like formats both for generation and consumption. Logentries has something called KVP (Key-Value- pair) that is very similar, except it does not enforce single-line aspect of it (but it will happily parse the tighter version). Fluentd for example collects logs from various sources, consolidates them by default in ORS-like format (strict single line entries), but can be configured to other formats as well. It then can send that data to many other tools for visualization, event notification, and other log analysis activities. I have already mentioned Docker logs that generate ORS-like logs by default. I don't think there is a pressing need to look for opportunistic NDJSON-similarity, in that case why would we put efforts to standardize yet another format.

jq in particular: https://stedolan.github.io/jq/ is well established and provides various unix-like tools for JSON data, including newline delimited.

It is a good tool, but the performance is not free. It may be a good tool to search in a small file, but when dealing with data stream at scale, tools like this will fail to perform well. Essentially they will require loading the whole file (at once or in stream mode) in order for them to perform the lookup for every single query. You are missing the point of CDXJ for example, where we want the lookup keys to be placed outside the JSON block so that we can perform plain text processing instead of parsing and loading individual objects which will be performance nightmare.

If anyone is going to be doing any custom processing with CDXJ, the easiest solution is to convert it back to NDJSON and then pass it to an existing tool.

I don't think converting back to NDJSON will be the easiest solution. If the values used in the lookup prefixes are desired to be present in the object when loaded, then keep them in the JSON object and surface a copy of their values in the prefix. Now use prefixes for lookup and pass the value JSON block to a widely supported natural JSON parser (no need to bring NDJSON in the mix). However, if the tool knows what the key prefix fields are then it can use those values without essentially requiring the duplicate data in the value object. Alternatively, the key values can be injected (merged) in the loaded object that was created after parsing the value block after the JSON is parsed as opposed to injecting key fields back to the the marshaled JSON and them parse it.

ikreymer commented 9 years ago

I think perhaps this is where we disagree... my original intention for using this format, is to support sorted, line-oriented data (CDX) with a variable number of fields.

Here's how my use case works: when writing out in this format, the internal representation is a stream of JSON dict (NDJSON), but the prefix is pulled out and written first, for sorting. Same for reading: the prefix is read, then the JSON dict, and the prefix is merged back in. (I'm using a specific prefix but can be generalized for generic NDJSON->CDXJ conversion) If sorting was not an issue, line-delimited JSON would have been used instead, and that would be the end of it!

This is a hybrid format specifically optimized for sorting, using it for anything else will likely be error prone, because multiple formats are mixed on the same line (space-delimited keys and JSON value)

I certainly do not have interest (or time) to create custom processing tools for this format! jq and other tools are specifically designed for processing structured JSON data, and standard *nix tools can work much better with CSV-style data.

Instead, I would focus on creating well-defined conversions to existing formats:

cat file.cdxj | cxdj-convert --to-ndjson | jq ...
cat file.cdxj | cdxj-convert --to-csv | <*nix tools>
cat file.cdxj | cdxj-convert --to-mrjob-json ... < processing for mrjob input >

If fast tabular processing is needed, you can convert to CSV-like format with specific columns, then pass to other tools (though this will not be "lossless" as CDXJ is not tabular, so some data may be dropped)..

If JSON structure parsing is needed on the value, then converting to a full JSON object and passing to jq or other JSON processing tools is the best approach.

If key-value processing for Hadoop is needed, then converting to an existing format designed for this, such as MRJob format, is the best approach.

The conversion options could also specify which value fields, or if just the key, just the value, or both should be used. The conversion tools should be made as flexible as possible to address all the use cases, and convert to any existing format for additional processing.

This is a domain specific format designed to solve a specific problem: sorting, merging line-oriented data with a varied number of fields, while other formats are better suited for other use cases, and have much better tooling. Thus, we should make the conversion process easy and well defined.

When sorting line oriented data, it is sometimes useful to ensure certain lines are always sorted first and that is the reason for specifying that such lines should start with @ -- no additional significance should be given to this at the format level, in my opinion.

As for Docker, Fluend, etc..., as mentioned before, I do not know their use cases, and just because this format happens to be a superset of their log formats is not a compelling enough reason for a new format. Unless those communities are involved in building another standard format, and tools around that format, I would be very cautious of putting any weight on this argument. If the primary use case is merging and sorting line oriented log files, than that is already covered. :)

ikreymer commented 9 years ago

If you disagree, please provide some examples, not related to sorting, where processing CDXJ directly has any advantage over converting to existing format :)

johnerikhalse commented 9 years ago

What about something like this for NDJSON representation?

{"@keys": ["surt", "timestamp"]}
{"_key": ["com,google)/", "20150125034709"], "timestamp": "20150125034709", "url": "http://www.google.com/", "length": "7674", "filename": "file1.warc.gz", "digest": "S3K4ZKZALJ4DB4RL2IQ6D233IW7XXLVO", "offset": "725344446"}

The requirement is that every data line starts with a well defined json key like '_key' which is not allowed as json key for the rest of the data. The value pointed to by '_key' is an array corresponding to the definition in the metadata line. This will be sortable with unix tools and readable by NDJSON tools. As the reserved json key '_key' starts with an underscore it's sorted after the metadata lines. One more requirement is that every line must follow the same pattern for whitespace, otherwise sorting will be broken.

ikreymer commented 9 years ago

Interesting, yeah I think this could make a lot of sense.. I am less concerned about sorting order in the NDJSON representation than unambiguous mapping to allow easy, well-defined conversion.

Unfortunately, there's not really a way to guarantee field order or spacing consistency in different JSON serializers. But, that is not essential, as sorting should be done before converting to NDJSON, or if needed, converting back to CDXJ after filtering.

This provides such an unambiguous mapping with the _key field, so users could get a NDJSON representation of CDXJ, performing filtering operations using NDJSON tools, and even convert back to CDXJ and sorting again (if needed):

cdx-server ... --to-ndjson | jq <filters, process NDJSON> | to-cdxj | sort > filtered.cdxj