rubensworks / jsonld-streaming-parser.js

A fast and lightweight streaming JSON-LD parser for JavaScript
https://www.rubensworks.net/blog/2019/03/13/streaming-rdf-parsers/
MIT License
79 stars 12 forks source link

Ensure backpressure is maintained in streams #71

Open rubensworks opened 3 years ago

rubensworks commented 3 years ago

There may be some places where we are not adhering to Node's backpressure conventions with regards to streams.

Concretely, we seem to be creating new streams (such as Readable, PassThrough and Transform), and push-ing into them (via a 'data' handler on another stream). Node handles backpressuring via the return value of push, which we are ignoring in this manner.

A better solution would be to simply pipe instead of calling push on each data element.

Related to https://github.com/rubensworks/rdf-parse.js/commit/269c757935c54b388e1bde076dc29c2afc2e8e7b

Related to #66

Tpt commented 2 years ago

I just had a quick look at it. I see two places where there might be a significant problems with backpressure:

  1. JsonLdParser.import does not care about backpressure at all. A quick way to add backpressure there would be to use the pipe method on the input stream when available.
  2. If no context is provided by the caller and not in streaming mode, the JSON-LD parser buffers all data before starting parsing only when the JSON is fully read. It will push all data when the last piece of JSON is read. I am not sure how much ignoring back pressure is bad knowing we already buffered the complete JSON into memory.
rubensworks commented 2 years ago

If no context is provided by the caller and not in streaming mode, the JSON-LD parser buffers all data before starting parsing only when the JSON is fully read. It will push all data when the last piece of JSON is read. I am not sure how much ignoring back pressure is bad knowing we already buffered the complete JSON into memory.

Indeed, unless the streaming profile is enabled, the full JSON-LD document is stored in memory before processing can start.

This is because without the streaming profile (which mandates that certain JSON keys come in specific orders), the following situation could occur:

{
  ... very long JSON-LD document
  "@context": "..."
}

Since the context in the example above comes at the end, the meaning of all preceding entries could be different. So that is why buffering must take place. With the streaming profile on the other hand, the @context MUST occur first (or after @type).

Just pinging @wouterbeek and @LaurensRietveld on this as well, as this will have an impact on the bounty's resolution. I.e., if the streaming profile is not explicitly enabled, full buffering will take place, and memory issues can still arise.

LaurensRietveld commented 2 years ago

Hi @rubensworks and @Tpt , having to explicitly enable the streaming profile in order to properly (without buffering fully) stream through the jsonld is fine from our end