Node.js Streams vs pull-streams vs ReactiveJS

dignifiedquire commented 8 years ago

Reasons for consideration

In the recent months of developing js-ipfs we had some issues with various parts of the different versions of node-streams. Some of the major ones were

Hard to control error handling, especially as we pipe and wrap streams many times from module to module
Issues with flushing data, write streams don't give guarantees about when data has been written to the underlying data source, for example disk. This resulted in some very problematic race conditions (ref #351)
Behaviour different depending on the module. Given the complicated history it is not always clear when combining streams from core and multiple modules in which mode they are currently and why they are not behaving as expected.

In light of these issues we are exploring different alternatives to replace the stream based code inside js-ipfs

Currently I have identified two possible alternatives

1. Pull Streams

Example code

dignifiedquire commented 8 years ago

The only viable other alternative to node-streams I found so far was https://pull-stream.github.io. They are quite nice and there is an example of secio implemented here: https://github.com/libp2p/js-libp2p-secio/pull/8

The main barrier I hit with them was integration with node-streams, so if we were to move the whole code base to use pull-streams this might solve those issues.

daviddias commented 8 years ago

First, thank you @dignifiedquire for chasing this streams issue and for researching possible solutions by doing experiments with actual implementations, really 5 🌟 👌.

To give a bit more emphasis to the issues we've realised (through the hard way), the Node.js Streams, also known as Streams 2, (problem 1) don't propagate errors, even (specially) when piped (.pipe) together. Adding to this, (problem 2) the fs writable streams don't guarantee that the data was flushed to disk when the 'end' event is emitted, given the chance for some race conditions when we try to read faster than the syscal has time to flush. The last issue (problem 3), doesn't come from the Node.js streams, but more from the streams ecosystem as a whole, where not all libraries that implement streams actually implement the full interface/expectations, giving room for some errors and requiring some monkey patching.

I believe, after having gone through and also talked with several people, that we shouldn't treat these 3 problems as one, and falling into the trap of patching the 3 pain points with a large bandaid, which will bring a lot more than what we need, increasing the complexity and also hitting performance and the js blob size.

Just to put in context, a normal communication between two IPFS peers is something like:

Candidate solutions

pump

pump is a small module that offers correct error propagation and stream destruction. It can be seen as .pipe done right.

pump is part of a collection of utility modules for Node.js Streams, called mississippi.

rxjs

rxjs is a very large project supported by lots of groups and companies, its documentation is in a very advanced state and exists for more languages than Node.js.

It offers a lot of nice APIs to operate streams of data, or 'arrays over time', however, (from what I can understand), doesn't offer any backpressure mechanism, essential for networking protocols or operations over large quantities of data. In some sense, it is similar to Node.js Streams 1 (Streams Classic), but with fancier APIs

pull-streams

pull-streams are super interesting, I really like how the ecosystem is growing and they result from similar frustrations that Node.js developers had in the past.

I was initially confused of how to ensure that pull-streams offer backpressure mechanism, specially compatible with the underlying transports (push streams), but @yoshuawuyts appeared in the IPFS workshop for the rescue and gave me a really good explanation.

Essentially pull-streams leverage the underlying push-streams highWatermark to do the network backpressure for them, as seen in the following image:

Problem 1 - propagation of errors

This is a requirement for js-ipfs, as each part of the code works isolated from the remaining services, but they all mount on libp2p to open the connections. As seen in Figure 1, an error in a network socket will cause several parts of the code to have to handle that error and react to it.

It seems to me that any of the proposed solutions would solve this problem really well. Then it will be up to the developers to understand how to handle the network errors if they open a connection themselves.

Problem 2 - ensure data gets flushed

It doesn't seem to me that any of the proposed solutions offers a fix for this problem, because in the end, we are always using Node.js stream to write to disk.

What we can (and should do) is solve this issue at the ipfs-repo level, making sure that data is flushed before we let the end event be emitted, and while we are at it, add a LRU cache to make reads faster, mitigating also the problem of the false negatives.

Problem 3 - streams ecosystem

This feels a bit like lying to ourselves, it is true that not all stream modules implement Node.js streams as we could expect, but that doesn't mean that whatever other modules we get for pull-streams or rxjs will do everything just right.

This has to be solved by having very comprehensive tests, unit, integration and chaos tests, to make sure we don't miss anything.

Important to note that, modules like spdy, won't have a rxjs or a pull-streams complete version (at least soon).

Other concerns

As we can see in Figure 1, we do a lot of plumbing between the several pieces of the code sometimes through read X bytes, others by directing the communication (.pipe), others by wrapping the stream into a new stream (duplexify). I don't see a clean and easy way to do that either with rxjs or pull-streams, without having to do even more wrapping and patching. This will make the code more complex and therefore, error prone.
rxjs requires some very specific wrapping of the transports to work (see https://github.com/dignifiedquire/rxjs.node/blob/master/src/socket-subject.js and https://github.com/dignifiedquire/rxjs.node/blob/master/src/stream-subject.js), this still doesn't feel right to me. To buy in into this interface, we would have to expose it on the libp2p-{tcp, websockets, webrtc, etc} transports too.
I strongly recommend and vote that we keep exposing natural (language constructs based) API, such as callbacks, promises an streams. Requiring a user to buy into a library to be able to use js-ipfs creates a lot of friction.
Each time we add a library, we add a considerable amount of code that gets bundled into the project and as you may know, currently js-ipfs is close to 4MB of JavaScript. If that library because a dependency of how all the other modules get attached together, than it gets very hard to phase it out one day.

Other thoughts

We have to be cautious when proposing/moving to other ecosystems. Every single one of them has its qualms and moving back and forth will always add noise in the migration. There are plenty of libraries and frameworks that optimize the developers experience, however, since they might require a huge paradigm shift on the code structure, they lack to get adoption. In the end, it is the people interest that will make a good lib/framework, a good framework. One good example is the async module, that although not a language construct, it is a well known module for the Node.js community and its patterns are wildly known.

This takes me to my next point which is, in order to maximize adoption and contribution, language API are typically the best bet.

Summing up

In the end, independently of the solution we decide to go for, we should understand and make sure we don't add more complexity and problems for the long term. Bringing in a framework to solve control flow will certainly add a considerable amount of overhead (the large bandaid).

We probably can get away with pump and the repo that knows how to cache and flush properly.

pull-streams look pretty exciting with their low footprint and convenience, but they only solve issues partially (error propagation) and we would need to expose regular streams anyhow, which they do pretty conveniently, same doesn't happen for rxjs.

Let's make sure to chat Friday and make a decision and move on, please do comment in this thread before hand, so that we make the meeting very efficient.

dignifiedquire commented 8 years ago

Thank you very much @diasdavid for this comprehensive post, below are a couple of notes just for clarification.

As we can see in Figure 1, we do a lot of plumbing between the several pieces of the code sometimes through read X bytes, others by directing the communication (.pipe), others by wrapping the stream into a new stream (duplexify). I don't see a clean and easy way to do that either with rxjs or pull-streams, without having to do even more wrapping and patching. This will make the code more complex and therefore, error prone.

This is very nicely solved in pull-streams. Wrapping and things like reading from streams per byte is super elegant, as can be seen for example here: https://github.com/libp2p/js-libp2p-secio/blob/pull/src/handshake/finish.js#L22-L32

This takes me to my next point which is, in order to maximize adoption and contribution, language API are typically the best bet.

Yes, but I don't think we should sacrifice our productivity just for adoption as a library. We don't want to clash with everything the community does, but this shouldn't stop us from trying to write as efficient as possible the most excellent code that we can.

We probably can get away with pump and the repo that knows how to cache and flush properly.

I'm afraid that we can't get away with this, as the data flushing issues are more serious than just the fs part. The most pressing issue, the one which lead me to investigating other stream solutions was the one I encountered when trying to implement secio. It encounters the issue that wrapping node-streams and trying to properly manage things like pause and resume data gets lost if you do not attach to it quick enough. As far as my understanding goes pump does not solve this issue in any way.

Let's make sure to chat Friday and make a decision and move on, please do comment in this thread before hand, so that we make the meeting very efficient.

I don't know how your schedule looks like but I would appreciate if we could actually have this chat today, given that it is currently blocking 90% of the work I want to get a move on.

daviddias commented 8 years ago

Ok, let's do it today then :) 4pm UTC sounds good? http://www.worldtimebuddy.com/?qm=1&lid=2267057,100,308,12&h=100&date=2016-8-4&sln=16-17

daviddias commented 8 years ago

Pinging @haadcode @nginnever @xicombd @RichardLitt @VictorBjelkholm @shamb0t and anyone else following this, you are welcome to join the chat, don't feel obligated to though :)

yoshuawuyts commented 8 years ago

It doesn't seem to me that any of the proposed solutions offers a fix for this problem, because in the end, we are always using Node.js stream to write to disk.

https://github.com/dominictarr/pull-fs uses fs.open() to create a file descriptor and then uses the cursor to write to disk in chunks. It sidesteps all of Node streams, so it's actually possible to ditch all, if not most of node streams. Probably the only place where that's currently not being done is net / http as they rely on the stdlib - but there's wrappers available that abstract away the internals and only expose the interface (such as pull-http-server). There should be no need to ever have node streams in the browser

Important to note that, modules like spdy, won't have a rxjs or a pull-streams complete version (at least soon)

Yeah, chances are slim that will happen soon. But that's fine because in Node land wrapping the Node streams API isn't the worst. It supports backpressure so any other paradigm can me mapped on it easily, and perf should be comparable. Here's an example pull-streams http wrapper

I strongly recommend and vote that we keep exposing natural (language constructs based) API, such as callbacks, promises an streams. Requiring a user to buy into a library to be able to use js-ipfs creates a lot of friction.

Probably the only native abstraction capable of doing streaming stuff well would be to have an explicit OO API to interface with the streaming state machine, like Node's fs file descriptor methods. All other abstractions could then be built on top of that.

I'd be super wary of exposing multiple interfaces as JS, the language™ is a moving target. E.g. what will the story be for async / await, event emitters, promise cancellables. I reckon versioning and updating interface-specific packages built around a single core API would be the most sane approach, keeping everyone happy.

An example OO API:

const ipfs = require('ipfs')

ipfs('/my-cool/repo-name', (err, node) => {
  if (err) throw err
  const cat = node.files.cat(function (err, data) {
    if (err) return console.log('error: ' + err)
    console.log('chunk: ' + String(data))
  })

  cat.start()
  cat.pause()
  cat.stop()

  node.stop()
})

For native abstractions you could totes create a plugin system to expose abstractions, if anything my playground repo has shown that wrapping the API isn't that hard:

// pull-stream
const ipfsPull = require('ipfs-pull-stream')
const pull = require('pull-stream')
const ipfs = require('ipfs')

ipfs('/my-cool/repo-name', (err, node) => {
  if (err) throw err

  node.use(ipfsPull())

  const cat$ = node.files.cat('<hash>')
  pull(cat$, pull.log())
})

// node streams
const ipfsNodeStreams = require('ipfs-node-streams')
const ipfs = require('ipfs')
const pump = require('pump')

ipfs('/my-cool/repo-name', (err, node) => {
  if (err) throw err

  node.use(ipfsNodeStreams())

  const cat$ = node.files.cat('<hash>')
  pump(cat$, process.stdout)
})

Does this make sense?

whyrusleeping commented 8 years ago

If you guys don't make a solid decision during this meeting i'm going to make the decision for you.

That is a threat I intend to follow through on.

hackergrrl commented 8 years ago

Really awesome work @dignifiedquire in concretely identifying the pain points in dealing with node streams. I can't make the hangout, but here are my thoughts:

For me, pull streams' # 1 superpower is their simplicity (in the Rich Hickey sense of the word): anyone can write a full pull stream implementation from scratch in a few minutes, from first principles. This is not true of node streams in the slightest. Simplicity brings transparency with it, meaning debugging and reasoning about implementation gets easier. I think this is the biggest argument I can make for pull streams. In this sense, I bet js-ipfs would benefit tremendously.

That said, pull streams' lack mainstream awareness. Despite being simple, grokking pull streams is certainly not easy. It took me a while to wrap my head around the big ideas, and I still don't wield them as intuitively as node streams. Pushing IPFS contributors to grok pull streams could hurt the project's approachability.

Summary: deviating from node streams in 2016 seems like a high-risk move re raising the barrier to contribution, despite potential gains. As @diasdavid said, piece-wise solutions to each problem (like pump) is probably the safest approach.

daviddias commented 8 years ago

Thank you everyone for all the input, it was really valuable for all the discussion.

Me and @dignifiedquire talked for a while and given our options, we decided to give pull-streams a spin, however this is not a full buy in, because we still do not know if it will solve the entire problem for libp2p (specially secio) and ipfs-repo.

The plan

Test pull-streams in libp2p

[x] Create a branch with pull-streams for:
- [x] tcp
- [x] spdy
- [x] swarm
- [x] multistream
- [x] secio
- [x] identify
[x] test libp2p with secio and with go-libp2p, making sure the impl is sound.

We hope to get this done no later by next Wednesday

IPFS Repo

[ ] Add a LRU cache to IPFS Repo
[x] Create interface-pull-blob-store
[ ] Implement s3-pull-blob-store
[x] Implement fs-pull-blob-store
[x] Implement idb-pull-blob-store
[x] Make the repo have a streaming interface again and integrate it with these pull-blob-stores

Note: We could avoid migrating these to pull-streams and just do the LRU cache + strong verification at the Repo, but since we are moving libp2p to pull-streams, it makes sense to use the same type of streams everywhere, plus we get all the other conviniences.

Acknowledgements

If pull-streams proves to be the right way, we will make sure to do the diligence of documenting how to use them and how to quickly convert any pull-stream to a regular Node.js stream, so if devs doesn't want to care about a pull-stream, they effectively don't need it, unless of course they want to contribute with patches, but that is where all the guidance comes in.
We want to make sure this discussion doesn't get lot in the sea of issues for future reference.
If pull-streams doesn't solve the issue, we will just have to acknowledge that and try Node.js streams again.

haadcode commented 8 years ago

we decided to give pull-streams a spin, however this is not a full buy in, because we still do not know if it will solve the entire problem for libp2p (specially secio) and ipfs-repo.

Can you give insights to what lead you to this decision? What were the pros and cons for pull-streams vs. RxJS?

daviddias commented 8 years ago

@haadcode The most important reasons are described in my first post, here: https://github.com/ipfs/js-ipfs/issues/362#issuecomment-237530938

In a tl;dr:

Moving back and forward between Node.js streams is complicated (while pull-streams is literally one function)
No notion of backpressure
Larger dependency
Higher cost of entrance. It requires more than learning how a lib works, but actually how the whole reactive programming paradigm should work.

ekroon commented 8 years ago

As an alternative to RX, with backpressure possibilities, there is also the Reactive Streams specification (http://www.reactive-streams.org/). It is still not something that feels very natural compared to streams in nodejs (compare https://github.com/reactive-streams/reactive-streams-jvm/blob/v1.0.0/README.md).

dignifiedquire commented 8 years ago

@ekroon I saw them mentioned in the discussions around backpressure but it seems there is no implementation in JavaScript currently. Implementing something like this would add a lot of work for us without being sure that it's worth it (https://github.com/reactive-streams/reactive-streams-js is empty :()

whyrusleeping commented 8 years ago

reactive streams also looks like its its own network protocol.

ekroon commented 8 years ago

@dignifiedquire @whyrusleeping in the simplest form reactive streams is only a set of interfaces that make it possible to subscribe to a stream and be notified of errors and the end of the stream (which is also in RX). In addition you can effectively switch between pull and push as there is an extra call on the subscription (normally only a cancel or unsubscribe), namely request(n); to request a maximum of new items. If the receiving end is limiting the flow by doing request(1) after a message, it would be effectively pull and when the sending end is limiting via request(bignumber); than it is effectively (unbounded) push.

Looking at node.js streams and pull-streams specifically, it looks to me that you can almost do the same with it. I am not completely sure if you can specify a lot of data and that way be a bit more pushy (I get a bit stuck when reading javascript ;)); and that reactive streams is (still) more a JVM party and not very wide spread, but that the interface itself would be mappable onto a pull stream implementation. (RxJs is pretty popular I thought nowadays for UIs, so there is hope for a reactive streams implementation)

As for a network protocol: it is not, but it is meant to be used within an application and crossing network boundaries. E.g. if I have an photo resize service split up in a resizer + stream chunker (creating JPEG or something) and a sender over the network (over some protocol providing flow control, TCP / SPDY ...), than I would be able to rate limit the sender by requesting only 1 new photo from the stream chunker which in turn could rate limit the TCP connection a little bit if needed.

dominictarr commented 8 years ago

for converting between node streams and pull streams: https://github.com/pull-stream/stream-to-pull-stream https://github.com/pull-stream/pull-stream-to-stream

stream-to-pull-stream has had a lot of battle testing and has been very reliable. in fact, it's worked so well that it's probably slowed down efforts to write lower level pull-streams which go directly to the libuv bindings. but those are popping up.

dignifiedquire commented 8 years ago

Closing as we have merged and released our first version of js-ipfs using pull-streams internally :)

ipfs / js-ipfs