rdfjs / stream-spec

RDF/JS: Stream interfaces – A specification of a low level interface definition representing RDF data independent of a serialized format in a JavaScript environment.
https://rdf.js.org/stream-spec/
5 stars 2 forks source link

Whatwg Streams interface #12

Open rescribet opened 5 years ago

rescribet commented 5 years ago

Is there any particular reason the current streams API is based on the event-based nodejs model, rather than the promise-based whatwg spec?

The latter has already been deployed in most browsers, and allows any type, so passing Quad instances as chunks would seem to be compliant, allowing full spec deference making this spec a lot simpler and cross-compatible.

The guys over at nodejs also seem to have started on implementing the whatwg spec, so switching might prevent premature obsolescence.

awwright commented 5 years ago

I'd also note Nodejs guarantees that callbacks will be called exactly once, but as far as I can tell, there's no equivalent finalization callback in our spec; the only option is to subscribe to every event.

I think some Promise interface would be beneficial for this.

On Jan 28, 2019 09:55, "Thom van Kalkeren" notifications@github.com wrote:

Is there any particular reason the current streams API is based on the event-based nodejs model, rather than the promise-based whatwg spec https://streams.spec.whatwg.org/?

The latter has already been deployed in most browsers, and allows any type https://streams.spec.whatwg.org/#model, so passing Quad instances as chunks would seem to be compliant, allowing full spec deference making this spec a lot simpler and cross-compatible.

The guys over at nodejs also seem to have started on implementing the whatwg spec https://github.com/nodejs/whatwg-stream, so switching might prevent premature obsolescence.

β€” You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rdfjs/representation-task-force/issues/149, or mute the thread https://github.com/notifications/unsubscribe-auth/AAatDeGj9Gay0hMCVjGZ01XL4XzjASedks5vHyt9gaJpZM4aWVvo .

RubenVerborgh commented 5 years ago

Is there any particular reason the current streams API is based on the event-based nodejs model

It's not, it's based on the subset of it that is fast πŸ™‚ e.g., no backpressure handling etc.

rather than the promise-based whatwg spec?

It is terribly slow still for the moment.

Source: had to roll my own iterator library for an RDF query engine that uses Terms internally, because Node.js streams (one magnitude) and promise-based streams (two magnitudes) are slower.

rescribet commented 5 years ago

e.g., no backpressure handling etc.

C-code without bounds checking should be significantly faster as well, but there are reasons why the features are there of course. One or two orders is a pretty large hit anyway, but I fear the features left out will find their way back into the spec over time (e.g. @awwright calling guarantees)

promise-based streams (two magnitudes) are slower.

The streams interface is relatively new and will probably gain performance improvements over time, especially when native support has been written into the engine (rather than building something custom). But it's already fast enough for the guys over at facebook to use it to stream video

Any promise based solution would have the overhead of instantiating, scheduling, and calling those promises, doing that for every statement separately would severely degrade performance. The current rdfjs spec emits one quad at-a-time, was the Promise-based implementation you tested also doing that?

RubenVerborgh commented 5 years ago

I fear the features left out will find their way back into the spec over time

The low-level spec provides everything a higher-level spec needs to build in such features as back-pressure handling and promises.

Any promise based solution would have the overhead of instantiating, scheduling, and calling those promises

Exactly.

doing that for every statement separately would severely degrade performance.

But that's the way you do (async) streams.

bergos commented 5 years ago

Besides the Quad streams our RDFJS packages will produce, they need to be feed by streams from other packages and need to write via streams to other packages. If you look at available, used and maintained packages, Node.js streams are the current standard. They are also available as separate package. In the latest version they can be used with Symbol.asyncIterator. As async loops are a JS language feature and not just a spec on top of JS, I think we are on the save side. This was also already discussed in #129.

We also discussed it in one of the calls and we decided to use the subset of Node.js streams. As this topic was already closed some time ago, I will close this issue. If there are new strong arguments for WHATWG streams, we can open the issue anytime.

elf-pavlik commented 5 years ago

Transfered to stream-spec repo of @rdfjs and reopened to clarify few things.

I tried adding parser-n3 based on n3.js to some @solid related project and webpack bundle includes ~70k from readable-stream repo which adds quite a lot to recommended performance budgets: https://medium.com/@addyosmani/the-cost-of-javascript-in-2018-7d8950fbb5d4

Source: had to roll my own iterator library for an RDF query engine that uses Terms internally, because Node.js streams (one magnitude) and promise-based streams (two magnitudes) are slower.

@RubenVerborgh do you refer to https://github.com/RubenVerborgh/AsyncIterator ? N3.js doesn't seem to use it. I see a lot of packages from @comunica, node-quadstore and some other modules in https://www.npmjs.com/browse/depended/asynciterator

I see that @mcollina besides readable-stream also works on https://github.com/nodejs/whatwg-stream/ maybe he would have some input on defining RDF/JS Streams interfaces in a way that they will align with a future direction of node and whatwg streams.

RubenVerborgh commented 5 years ago

Nope, N3.js should work out of the box without any stream dependencies, depending on what exactly you use.

mcollina commented 5 years ago

I think the best way is to add support for async iterators - when they are available. Both streams implementation offer them, they are fast, and they are more ergonomic and intuitive. They do the right think with error handling, which is always tricky.

RubenVerborgh commented 5 years ago

Thanks @mcollina, but It's more nuanced than that:

  1. ES6 AsyncIterator is still at least a magnitude slower. Matters if you commonly have millions of elements.

  2. We only use a subset of the Stream interface; that subset is covered by simpler and more performant libs like https://github.com/RubenVerborgh/AsyncIterator

We are keeping an eye on ES6 AsyncIterator performance (I wish we had live tests in a browser, @joachimvh), but it's not looking good. Logical, because the protocol does more than the subset of stream we use.

elf-pavlik commented 5 years ago

Thanks @mcollina and my bad with splitting conversation between here and twitter: https://twitter.com/elfpavlik/status/1104500178432806912

We had some prior conversation where more than one person expressed :+1: for ES Asynciterator https://github.com/rdfjs/dataset-spec/issues/42#issuecomment-456155536

RubenVerborgh commented 5 years ago

^ On high-level. Do this on low-level, and be prepared to handle streams one magnitude slower πŸ˜‰

elf-pavlik commented 5 years ago

Can we find a way to provide both:

We only use a subset of the Stream interface; that subset is covered by simpler and more performant libs like https://github.com/RubenVerborgh/AsyncIterator

Last two AFAIK have common interface but not the WHATWG ReadableStream.

RubenVerborgh commented 5 years ago

I currently don’t see a way. The different interfaces are key to performance; change them and you change performance.

mcollina commented 5 years ago

Agreed. You can easily add async interator support to your module and achieve better usability. It also does not remove your fast API when there is a need to.

elf-pavlik commented 5 years ago

Sounds like:

mcollina commented 5 years ago

You can provide an implementation of Symbol.asyncIterator without using any stream library.

elf-pavlik commented 5 years ago

I probably need to reformulate the problem. How we could implement those two parsers

So that one could use them in a web browser, parse response.body (a WHATWG ReadableStream), and add quads to the store. Currently both parsing and adding to store relies on Sink#import interface which expects a Stream with node style read() method on it. Besides parsers also quadstore module also implements Sink#import: https://github.com/beautifulinteractions/node-quadstore#rdfstoreprototypeimport

Would all three examples above and other similar modules, provide alternative interface which would expect ES AsyncIterator and return one? Application could choose which interface to import and ES modules could allow granular fetching or tree shaking.

Still, currently @rdfjs modules seem to directly import node stream eg. https://github.com/rubensworks/jsonld-streaming-parser.js/blob/master/lib/JsonLdParser.ts#L5

BTW it also seems that even if application would want to use fast asynciterator module, libraries above will bundle in readable-stream anyways since it gets imported there.

We also have this issue, to distribute those libraries as ES modules and not depend on Common JS and bundling: https://github.com/rdfjs/parser-n3/issues/4

Which also might have benefits in cases where application uses dynamic import() to lazy load functionality.

RubenVerborgh commented 5 years ago
  • if one has processing speed as priority one would use asynciterator module and send extra ~55k to the browser

4k gzipped

  • if one wants to have read() which currently this spec requires on Streams but also has priority to use Symbol.asyncIterator over speed, would use readable-stream module and send extra ~70k to the browser

asynciterator also has read().

  • if one has priority to minimize code send to the browser and wants to use Symbol.asyncIterator provided by WHATWG ReadableStream, out of luck since this spec requires read()?

yes

RubenVerborgh commented 5 years ago

Would all three examples above and other similar modules, provide alternative interface which would expect ES AsyncIterator and return one?

We probably want an opt-in spec for ES6 AsyncIterator. Although it wouldn't be much of a spec; it's basically just ES6 AsyncIterator. We might want to add that to stream-spec itself (but might be good to have it orthogonal).