halaxa / json-machine

Efficient, easy-to-use, and fast PHP JSON stream parser
Apache License 2.0
1.08k stars 65 forks source link

Parsing nested values in objects #95

Open kkozlik opened 1 year ago

kkozlik commented 1 year ago

Hello, I am wondering whether is it possible to to parse nested values in objects using json machine. The manual describe Parsing nested values in arrays using the - in pointer, but this unfortunatelly does not work in case of objects.

I have a JSON in format:

{
    "results": {
        "fruits": {
            "apple": {
                "color": "red"
            },
            "pear":{
                "color": "yellow"
            }
        },
        "vegetable": {
            "carrot": {
                "color": "red"
            }
        }
    }
}

And I do not know the categories (fruits, vegetable,...) in advance. When I use pointer like /results the parser reads the whole category into memory which still could be pretty big.

Is it somehow possible to read the names of categories only and skip storing its actual content into memory? The PassThruDecoder still loads the content into memory, it just do not decode it. Maybe somehow instruct the parser to parse only keys, but do not read the content of the { } into $jsonBuffer? Once I have list of categories, I can parse the objects in second round, using pointers for each category.

Or maybe another solution could be implement the hyphen pointer also for objects. Then I can parse the file with pointer like /results/- and read the category names using the getCurrentJsonPointer() function?

halaxa commented 1 year ago

Hi, thanks for participating.

As you said, it is not possible to parse nested items in objects. JSON Pointer is too simple a language for that. We can't just start supporting "-" as a wildcard for object keys because what if a key in an object is "-"?

Parser has to always read everything to get to the object keys. The main use case for the parser is to sequentially read all the items in a specified subtree. That's why it stores every item in memory. It's usually what a programmer wants.

Once I have list of categories, I can parse the objects in second round, using pointers for each category.

This will only complicate things for you. If you're already there, decode it and use it. The second round will do exactly the same work as the first. If you expect it to be somehow more efficient or faster the second round, keep in mind that using json pointers will not affect parsing time in any way, only memory usage. The parser always has to read everything to get to the desired key. No direct access as in hashmaps.

PassThruDecoder is there for such situations. If a single item is too big, do the top-level parsing using it and then parse the produced string via ExtJsonDecoder as shown in README.

If you're really low on memory try #36. The prototype should work. It should be installable via

composer require halaxa/json-machine:dev-recursive

It might nudge me to finish it :)

If any of this is of no use to you, try for example salsify/jsonstreamingparser.

Does this answer your questions?

kkozlik commented 1 year ago

We can't just start supporting "-" as a wildcard for object keys because what if a key in an object is "-"?

Yep, true. But I am sure this problem would be solvable with some kind of escaping. And maybe use something else than hyphen...

Parser has to always read everything to get to the object keys.

I perfectly understand this. I just though that the value does not need to be hold in memory. If the /results/fruits has few thousands of records, few MB each, than even use of PassThruDecoder means holding some GBs in memory. However if the parser would not hold that text in $jsonBuffer variable, just iterate over keys and throw those data away, I would be able to iterate over keys with use of very few memory only, would not I? And once I have the keys, I can construct pointers like /results/fruits, /results/vegetable, etc. and iterate over the file once again and hold only single record in the memory each time. That was just my dumb idea...

The #36 looks promising and much more elegant solution of course. I will try it.

Thanks for pointing to salsify parser, I will also have a look to it.

halaxa commented 1 year ago

Yep, true. But I am sure this problem would be solvable with some kind of escaping. And maybe use something else than hyphen...

The only option via escaping would be adding another escape sequence, for example ~2 which could mean asterisk (maybe .* regex equivalent), and thus stop being compatible with the json pointer spec. I'm not sure if I incline to this.

https://datatracker.ietf.org/doc/html/rfc6901#section-3

XedinUnknown commented 1 year ago

@halaxa, hi! Thanks for the awesome lib! I really hope it will become a fully-comprehensive solution one day, because as of now it seems to be at least the best 🙏

To the point, IMHO: if the spec doesn't mention that you must not introduce any other escape sequences and everything besides existing defined 2 sequences must be treated literally, I believe you would technically still be compatible with their spec; you'd simply supersede it.