Open halaxa opened 2 years ago
On Linux this can already be done using jq: https://stedolan.github.io/jq/
Good point, I didn't know, that jq supports stream parsing. The speed will be incomparable, that's clear, but the jq's usage with stream parsing seems somehow unintuitive. While looking at the usage of jq another option came to my mind for jm
:
$ wget <big list of users to stdout> | jm --pointer=/results
{"key": 0, "value": {"name": "Frank Sinatra", ...}}
{"key": 1, "value": {"name": "Ray Charles", ...}}
...
It is extensible for other fields in the future such as position
, matchedPointer
and so on...
You're right, using jq
for stream parsing is not intuitive at all. A cli-wrapper for JSON Machine can be easily added:
#!/usr/bin/env php
<?php
use JsonMachine\Items;
if ( ! is_file(dirname(__DIR__).'/vendor/autoload.php')) {
throw new LogicException('Composer autoloader missing. Try running "composer install".');
}
require_once dirname(__DIR__).'/vendor/autoload.php';
function usage()
{
echo sprintf('usage: %s --pointer=""', __FILE__)."\n";
exit(1);
}
$options = getopt(null, ['pointer:']);
if (!isset($options['pointer'])) {
usage();
}
$iterator = Items::fromFile('php://stdin', $options);
foreach ($iterator as $row) {
echo json_encode($row)."\n";
}
Yes, something along those lines. Using PassthruDecoder
would eliminate the overhead of encoding and decoding back. Also, a simple templating system (Mustache for example) could be used to allow the user to format decoded data if that's what they wish. Like:
$ wget <big list of users to stdout> | jm --item-template="{{name}};{{born}}"
Frank Sinatra;1915
Ray Charles;1930
Combined with json pointer it could be quite versatile.
For the uninitiated, jq's streaming parser is usually quite difficult to use, but worse, for the following two essential tasks (described here using standard jq syntax), it is typically very slow (many hours or days) for very big files:
.[]
- that is, "explode" an array into a stream of its top-level items;
keys_unsorted[] as $k | {($k): .[$k]}
-- that is, "exploding" an object into a stream of corresponding singleton (i.e. single-key) objects.
To my knowledge, there is currently no CLI-tool for running these two jq queries conveniently, speedily, and losslessly against very large JSON arrays or objects, respectively. (By "lossless" I mean avoiding the loss of precision in handling JSON numbers.)
Being able to use JSON Pointer to fine-tune the point of the "explosion" would be fantastic!
Thank you!
@fwolfsjaeger - Unfortunately your script does not preserve the JSON structure of the items at the specified point(s).
Or at least, I tried it with 'pointer' => '/-' and with input: [1,2,[10,20],30]
but the array is in effect completely flattened.
--
Incidentally, after running composer
successfully, I tried running your script (in the same directory), but the result is
a fatal error, with the message:
Composer autoloader missing.
(In fact, both the files
./vendor/autoload.php ./vendor/halaxa/json-machine/src/autoloader.php
are present.)
Or at least, I tried it with 'pointer' => '/-' and with input:
[1,2,[10,20],30]
but the array is in effect completely flattened.
That's correct behavior. If you want to iterate top level, use empty string JSON pointer (default). Read more about it in REAMDE to see how exactly a hyphen in JSON pointer works. By using /-
you tell it you want to iterate over each iterable on the top level array. But most of the values are scalars, so they are not iterated, just passed along as they are. But when it hits the only array you have there, it is itself iterated. So the result is flat.
@halaxa - Thank you for your explanation. Please understand that the difficulty I had was precisely because I read the README quite closely, the point being that in JSON, numbers are scalars, not iterables. That is, I would have expected that an attempt to iterate over a number would either result in an error, or nothing at all. (In jq, gojq, and jaq, it results in an error e.g. $JQ -n '12 | .[]'
.)
Part of my confusion arose from statements such as the following in an "Overview of JSON Pointer": (*)
Note that the first character of this String [the JSON Pointer] is a ‘/' – this is a syntactic requirement.
When I tried using "/" as the JSON Pointer, I just got an error, so "/-" seemed like the next best bet.
The fact that #fwolfsjaeger's script requires a pointer didn't help my understanding.
Now that I understand how to iterate over an array, I would like to know how to avoid loss of numeric precision, e.g.
400000000000000000000000000000000000000000000000000000000123 => 4.0e+59
Thank you again.
No problem :)
This sencence
Note that the first character of this String is a ‘/' – this is a syntactic requirement.
from here https://www.baeldung.com/json-pointer is incorrect. See https://www.rfc-editor.org/rfc/rfc6901#section-5. The official RFC is also linked from the JSON Machine README https://github.com/halaxa/json-machine#what-is-json-pointer-anyway.
"/"
actually means Iterate over empty string key in root dictionary.
I'll elaborate on the other two points of yours later. Hopefully tomorrow.
@halaxa @fwolfsjaeger - My PHP was never good to begin with and is by now very rusty, but the following script has already proven useful to me and might provide a basis for further improvements. Suggestions would of course be welcome.
[EDIT: The script has been moved to Issue#88 ]
Can you move this last post to a new discussion, please?
@halaxa - The last post is essentially a CLI script, so I thought this would be the best thread?
By the way, many of my colleagues who might benefit from a script such as jm would probably be discouraged by the installation hurdles that currently exist, so I was wondering whether you could envision at some point making JSON Machine available via homebrew
? Or are there other alternatives you can suggest?
It sure is a cli script. But I understand you want some suggestions. Discussions would be better place for this. If you want to actually participate with some code to this repository, please use a pull request. This thread is mainly for ideas and suggestions about how should CLI interface work. As for other installation channels, I'll let someone else to do it for now. It needs its own maintenance time which I don't have. It's OSS, anyone can generate any package from any revision. But thank you for your suggestion. Please keep them coming ;)
@pkoppstein That is, I would have expected that an attempt to iterate over a number would either result in an error, or nothing at all. (In jq, gojq, and jaq, it results in an error e.g. $JQ -n '12 | .[]'.)
I understand the confusion. The idea is, that you can specify either iterable or scalar and JSON Machine will always give it to you. Of course you can run into a confusion when usin wildcard JSON Pointer. I have an idea, what about having an option for enabling strict mode? Either you specify AUTO
(current behavior) or explicitly set SCALAR_ONLY
or VECTOR_ONLY
based on what you want to iterate and which error you want to get otherwise.
@pkoppstein Now that I understand how to iterate over an array, I would like to know how to avoid loss of numeric precision, e.g. 400000000000000000000000000000000000000000000000000000000123 => 4.0e+59
Pass a custom ExtJsonDecoder
instance do decoder
option configured with JSON_BIGINT_AS_STRING
.
@halaxa - I've created Issue#88 in accordance with your request. Should I now delete the script from the message in this thread?
As mentioned in Issue#87, I'm not sure how JSON_BIGINT_AS_STRING helps, as it converts all "bigints" to strings, and doesn't even attempt to handle big or small decimals.
As for SCALAR_ONLY
/VECTOR_ONLY
, I think that the current behavior is fine, and maybe even the "correct" one, depending on how it's documented. In the "help" provided by my version of the jm script, I've attempted to clarify the intent by emphasizing the idea of "streaming" rather than "iteration".
I've created Issue#88 in accordance with your request. Should I now delete the script from the message in this thread?
I agree with deleting it.
As mentioned in Issue#87, I'm not sure how JSON_BIGINT_AS_STRING helps, as it converts all "bigints" to strings, and doesn't even attempt to handle big or small decimals.
Your example has an integer, so that's why I suggested it. What exact decimal problem are you facing? Oh, I read this before #87. Let's continue there.
It's not clear to me how a CLI script can be implemented to handle a JSON file that contains more than one top-level JSON entity. There are tools for converting such files to JSONLines format (one JSON entity per line), but it's inconvenient to have to place each of these in a separate file for the sake of JSON Machines. In addition, some of these tools lose numerical precision.
Any suggestions would be appreciated. Thanks.
Please let me know by reactions/voting or comments if a CLI version of JSON Machine would be useful to have. Thanks.
jm
command would take a JSON stream fromstdin
, and send items one by one tostdout
wrapped in a single-item JSON object encoded as{key: value}
.Possible usage:
Another idea might be to wrap the item in a JSON list instead of an object, like so: