ohler55 / ojg

Optimized JSON for Go
MIT License
834 stars 50 forks source link

Extract data out of large JSON #179

Closed mitar closed 4 weeks ago

mitar commented 1 month ago

I am building a tool which would extract data from a potentially large JSON. If data is ndjson, then it is easy to read it line by line and extract data from each separate object. But if data is in a large JSON array, or even worse, a large JSON array nested under one field in JSON object (example of such a file is brandedDownload.json), then it seems I have to first load into the memory the whole file before I can extract that data out using JSONPath provided by this package. It would be nice if I could lazily construct the path and then iterate over the nested array, loading into the memory just the amount which is needed to iterate to the next array element.

ohler55 commented 1 month ago

Have you looked at the Tokenizer? It requires a bit more work to deal with the callbacks but it does allow processing without loading the whole JSON into memory.

mitar commented 1 month ago

You mean this Tokenize? But I could not find any existing TokenHandler implementation, especially jp package does not seem to provide a TokenHandler which could be used in a streaming manner? So I would like to offer users of my tool that they write a query/path of what to extract in standard JSONPath, but it seems I would then have to implement conversion from JSONPath to TokenHandler myself to be able to do this in streaming manner?

ohler55 commented 1 month ago

The TokenHandler is an interface meant to be implemented by the caller. An example is in the tokenizer_test.go file.

The request as you described it wants to look ar effectively arbitrary elements of a JSON document. OjG provides a means to look at all the elements of a JSON document using the Tokenizer. By creating their own TokenHandler the caller can decide which element to keep and which to discard. A path can be build using the Key method of the TokenHandler.

ohler55 commented 1 month ago

BTW, I do like your idea. If you were not planning on making the handler I might implement it myself. Happy to let you make the offering though.

mitar commented 1 month ago

Yea, I think I get the idea how this could work, but sadly I do not have time currently that I would implement this myself. So feel free to go for it. If I will be able to circle back to this, I will notify you here.

ohler55 commented 1 month ago

It rained this morning so I implemented the Match functions along with a jp.MatchHandler. There are tests in the oj, sen, and jp directories if you want to see how they are used. Please give them a try and let me know what you think.

All in the "path-handler" branch.

ohler55 commented 1 month ago

I've also updated the oj command (cmd/oj/oj) to handle the path handler with the "dig" option. I'll release tomorrow unless you have some feedback before then.

ohler55 commented 1 month ago

v1.24.0 released with the additional feature.

mitar commented 6 days ago

I never responded here, but this really looks awesome! Thanks!