rubensworks / jsonld-streaming-parser.js

A fast and lightweight streaming JSON-LD parser for JavaScript
https://www.rubensworks.net/blog/2019/03/13/streaming-rdf-parsers/
MIT License
80 stars 13 forks source link

Replace jsonparse dependency #100

Closed Tpt closed 1 year ago

Tpt commented 2 years ago

This is a subtask of #76

The jsonparse library keeps the full JSON content into memory to fill the "value" attribute. This leads to big memery consumption as reported in #65.

Some alternative are described in rubensworks/test-performance-json-parse-stream.

Some other ones:

Possible way forward.

If we want to squeeze as much performance as possible, the SAX variant of jsonparse seems the way to go but it means maintaining yet an other library. The two other options requires either forking or managing to get a change merged and deployed and that seems not trivial.

rubensworks commented 2 years ago

the SAX variant of jsonparse

The biggest problem with this one is the fact that there's no license on the gist, which means that we can't just copy-paste it and package it ourselves. But if you think this one is promising, we could ask the author if he agrees with us reusing it under a certain license.

Make stream-json compatible with web browsers.

This also sounds like a good solution if that SAX variant of jsonparse is impossible. We could discuss with the author if he's open to a PR to make it web-compatible.

Tpt commented 2 years ago

the SAX variant of jsonparse

The biggest problem with this one is the fact that there's no license on the gist, which means that we can't just copy-paste it and package it ourselves. But if you think this one is promising, we could ask the author if he agrees with us reusing it under a certain license.

That's a good point. An even easier approach: I use as base jsonparse and I simplify it to avoid the memory problem. It should not be too hard.

If it's fine for you, I move in this direction.

rubensworks commented 2 years ago

I use as base jsonparse and I simplify it to avoid the memory problem. It should not be too hard.

Not sure that will be very easy to do. But if you think that's possible, this would indeed be the best way forward.

If it's fine for you, I move in this direction.

👍

Tpt commented 2 years ago

I have started the fork here: https://github.com/Tpt/json-event-parser.js

I have modernized the codebase by converting it to TypeScript. Next step: avoid keeping all data into memory.

Tpt commented 2 years ago

I have now a working version without in-memory buffering. I ran again the JSON performance benchmark (code here):

Library Execution time Memory
stream-json 4.770s 89MB
jsonparse 3.381s 410MB
SAX variant of jsonparse 1.973s 100MB
clarinet 2.425s 94MB
json-event-parser (new) 2.152s 94MB

So json-event-parser is slightly slower than the fastest SAX variant of jsonparse (0.2s slower) and uses slightly more memory than the least memory hungry stream-json (5MB).

@rubensworks How do you want us to proceed? I might first do a WIP PR on jsonld-streaming-parser using json-event-parser instead of jsonparse.

rubensworks commented 2 years ago

@Tpt Going forward with the new json-event-parser lib sounds perfect to me! Great that you've managed to reduce exec time and memory usage there.

Is it already capable of true streaming (which jsonparse could not)? E.g., if you attach a data listener for a large file, and remove it again after 100 data events, the file is not processed anymore?

Would you agree with moving the json-event-parser repo (not urgent) under the comunica namespace? Just so that we're certain that we can still maintain it in case you would not be available anymore in the future for whatever reason. You can still keep full push access and be listed as the author of course.

Tpt commented 2 years ago

Is it already capable of true streaming (which jsonparse could not)? E.g., if you attach a data listener for a large file, and remove it again after 100 data events, the file is not processed anymore?

Yes, with the restriction that the current json-event-parser implementation can not be stopped after any event. It will keep reading the chunks that were already given to it up to their end. But it's not a big deal if the written chunks are small. I am considering making the JsonEventParser implement the Transform stream interface to make its API cleaner and avoid this limitation. However, I am not sure how to wrap a Transform inside of an other Transform so I am not sure if it would be possible to keep the JsonLdParser API unchanged.

Would you agree with moving the json-event-parser repo (not urgent) under the comunica namespace?

With pleasure. It would be much better this way.

rubensworks commented 2 years ago

It will keep reading the chunks that were already given to it up to their end. But it's not a big deal if the written chunks are small.

That sounds ok.

However, I am not sure how to wrap a Transform inside of an other Transform so I am not sure if it would be possible to keep the JsonLdParser API unchanged.

This would be a major version bump anyways, so it's ok if we need to do other breaking changes.

Tpt commented 2 years ago

However, I am not sure how to wrap a Transform inside of an other Transform so I am not sure if it would be possible to keep the JsonLdParser API unchanged. This would be a major version bump anyways, so it's ok if we need to do other breaking changes.

This would lead API user to either use JsonLdParser.import(myStream) or to have to write myStream.pipe(new JsonEventParser()).pipe(new JsonLdParser()). I am not sure it is very convenient. I am looking for a clean way to hide JsonEventParser inside of JsonLdParser while keeping backpressure but it does not sound so easy. If you know how to do it, it would be much welcomed.

Edit: I have found stream.compose which does exactly what I want.

rubensworks commented 1 year ago

We went for another temp workaround in 448c8fe47fcd8d7f2f9124edbf2631913eb9b460. But the new json-event-parser will still be needed for #76.