Open ikreymer opened 9 years ago
This escaping issue has already been covered briefly in the blog post as follows:
Since the opening square and curly brackets indicate the start of the JSON block, hence it is necessary to escape them (as well as the escape and double quote characters) if they appear in the keys, and optionally their closing pairs should also be escaped.
Let me reiterate it, none of the ORS or CDXJ support objects as keys or multiple objects as values. The prefix key is optional and if present it can be one or more string tokens quoted or unquoted. The value portion is one and only one instance of a single JSON block per line. The value block can be object format or array format JSON which can have arbitrary number of nesting. The value block can be an empty JSON, but cannot be blank/nil. I hope this resolves all the concerns raised here.
Yes, I think so. I bring this up with the default MRJob tab-delimited {...}\t{...}
format as consideration.
At first glance, it would appear that it could be a compatible (subset) of ORS, with the key also being a JSON dict, but if escaping {
is part of the requirement, it would not, since the first JSON dict is not escaped in MRJob format.
Additionally, if a data key begins with @
sign, the key should be quoted.
I am not sure about the reason why MRJob has an object for the key instead of a basic data type in the tuple, but in the current format it is not compatible with ORS. I have expressed my thoughts around it in the email.
For parsing CDXJ/ORS, need to ensure there is no ambiguity when the key ends.
Ambiguities can occur if there is a
{
anywhere in the key..For CDXJ, this is usually avoided as keys are usually url-encoded and there are no spaces in urls. But should this be a requirement? Or escaping spaces and {?
For ORS, there is of course the general case of multiple JSON dicts, with other nested JSON dicts.
{"foo": "bar"} {"boo": "baz", "foo2": {"a": {"c": "d"}} {"key": "value", "key2": {"a": "b"}}
Since the value must be a valid JSON dict, it would have to be: value -
{"key": "value", "key2": {"a": "b"}}
key -{"foo": "bar"} {"boo": "baz", "foo2": {"a": {"c": "d"}}
Could get tricky if this is to be supported with a more generic key, though I guess escaping enforcement should help...