Closed dignifiedquire closed 8 years ago
The only viable other alternative to node-streams I found so far was https://pull-stream.github.io. They are quite nice and there is an example of secio implemented here: https://github.com/libp2p/js-libp2p-secio/pull/8
The main barrier I hit with them was integration with node-streams, so if we were to move the whole code base to use pull-streams this might solve those issues.
First, thank you @dignifiedquire for chasing this streams issue and for researching possible solutions by doing experiments with actual implementations, really 5 🌟 👌.
To give a bit more emphasis to the issues we've realised (through the hard way), the Node.js Streams, also known as Streams 2
, (problem 1) don't propagate errors, even (specially) when piped (.pipe) together. Adding to this, (problem 2) the fs writable streams don't guarantee that the data was flushed to disk when the 'end' event is emitted, given the chance for some race conditions when we try to read faster than the syscal has time to flush. The last issue (problem 3), doesn't come from the Node.js streams, but more from the streams ecosystem as a whole, where not all libraries that implement streams actually implement the full interface/expectations, giving room for some errors and requiring some monkey patching.
I believe, after having gone through and also talked with several people, that we shouldn't treat these 3 problems as one, and falling into the trap of patching the 3 pain points with a large bandaid, which will bring a lot more than what we need, increasing the complexity and also hitting performance and the js blob size.
Just to put in context, a normal communication between two IPFS peers is something like:
pump is a small module that offers correct error propagation and stream destruction. It can be seen as .pipe
done right.
pump is part of a collection of utility modules for Node.js Streams, called mississippi.
rxjs is a very large project supported by lots of groups and companies, its documentation is in a very advanced state and exists for more languages than Node.js.
It offers a lot of nice APIs to operate streams of data, or 'arrays over time', however, (from what I can understand), doesn't offer any backpressure mechanism, essential for networking protocols or operations over large quantities of data. In some sense, it is similar to Node.js Streams 1 (Streams Classic), but with fancier APIs
pull-streams are super interesting, I really like how the ecosystem is growing and they result from similar frustrations that Node.js developers had in the past.
I was initially confused of how to ensure that pull-streams offer backpressure mechanism, specially compatible with the underlying transports (push streams), but @yoshuawuyts appeared in the IPFS workshop for the rescue and gave me a really good explanation.
Essentially pull-streams leverage the underlying push-streams highWatermark to do the network backpressure for them, as seen in the following image:
This is a requirement for js-ipfs, as each part of the code works isolated from the remaining services, but they all mount on libp2p to open the connections. As seen in Figure 1, an error in a network socket will cause several parts of the code to have to handle that error and react to it.
It seems to me that any of the proposed solutions would solve this problem really well. Then it will be up to the developers to understand how to handle the network errors if they open a connection themselves.
It doesn't seem to me that any of the proposed solutions offers a fix for this problem, because in the end, we are always using Node.js stream to write to disk.
What we can (and should do) is solve this issue at the ipfs-repo level, making sure that data is flushed before we let the end event be emitted, and while we are at it, add a LRU cache to make reads faster, mitigating also the problem of the false negatives.
This feels a bit like lying to ourselves, it is true that not all stream modules implement Node.js streams as we could expect, but that doesn't mean that whatever other modules we get for pull-streams or rxjs will do everything just right.
This has to be solved by having very comprehensive tests, unit, integration and chaos tests, to make sure we don't miss anything.
Important to note that, modules like spdy
, won't have a rxjs or a pull-streams complete version (at least soon).
We have to be cautious when proposing/moving to other ecosystems. Every single one of them has its qualms and moving back and forth will always add noise in the migration. There are plenty of libraries and frameworks that optimize the developers experience, however, since they might require a huge paradigm shift on the code structure, they lack to get adoption. In the end, it is the people interest that will make a good lib/framework, a good framework. One good example is the async
module, that although not a language construct, it is a well known module for the Node.js community and its patterns are wildly known.
This takes me to my next point which is, in order to maximize adoption and contribution, language API are typically the best bet.
In the end, independently of the solution we decide to go for, we should understand and make sure we don't add more complexity and problems for the long term. Bringing in a framework to solve control flow will certainly add a considerable amount of overhead (the large bandaid).
We probably can get away with pump
and the repo
that knows how to cache and flush properly.
pull-streams look pretty exciting with their low footprint and convenience, but they only solve issues partially (error propagation) and we would need to expose regular streams anyhow, which they do pretty conveniently, same doesn't happen for rxjs.
Let's make sure to chat Friday and make a decision and move on, please do comment in this thread before hand, so that we make the meeting very efficient.
Thank you very much @diasdavid for this comprehensive post, below are a couple of notes just for clarification.
As we can see in Figure 1, we do a lot of plumbing between the several pieces of the code sometimes through read X bytes, others by directing the communication (.pipe), others by wrapping the stream into a new stream (duplexify). I don't see a clean and easy way to do that either with rxjs or pull-streams, without having to do even more wrapping and patching. This will make the code more complex and therefore, error prone.
This is very nicely solved in pull-streams
. Wrapping and things like reading from streams per byte is super elegant, as can be seen for example here: https://github.com/libp2p/js-libp2p-secio/blob/pull/src/handshake/finish.js#L22-L32
This takes me to my next point which is, in order to maximize adoption and contribution, language API are typically the best bet.
Yes, but I don't think we should sacrifice our productivity just for adoption as a library. We don't want to clash with everything the community does, but this shouldn't stop us from trying to write as efficient as possible the most excellent code that we can.
We probably can get away with pump and the repo that knows how to cache and flush properly.
I'm afraid that we can't get away with this, as the data flushing issues are more serious than just the fs part. The most pressing issue, the one which lead me to investigating other stream solutions was the one I encountered when trying to implement secio. It encounters the issue that wrapping node-streams and trying to properly manage things like pause
and resume
data gets lost if you do not attach to it quick enough. As far as my understanding goes pump
does not solve this issue in any way.
Let's make sure to chat Friday and make a decision and move on, please do comment in this thread before hand, so that we make the meeting very efficient.
I don't know how your schedule looks like but I would appreciate if we could actually have this chat today, given that it is currently blocking 90% of the work I want to get a move on.
Ok, let's do it today then :) 4pm UTC sounds good? http://www.worldtimebuddy.com/?qm=1&lid=2267057,100,308,12&h=100&date=2016-8-4&sln=16-17
Pinging @haadcode @nginnever @xicombd @RichardLitt @VictorBjelkholm @shamb0t and anyone else following this, you are welcome to join the chat, don't feel obligated to though :)
It doesn't seem to me that any of the proposed solutions offers a fix for this problem, because in the end, we are always using Node.js stream to write to disk.
https://github.com/dominictarr/pull-fs uses fs.open()
to create a file
descriptor and then uses the cursor to write to disk in chunks. It sidesteps
all of Node streams, so it's actually possible to ditch all, if not most of
node streams. Probably the only place where that's currently not being done is
net
/ http
as they rely on the stdlib - but there's wrappers available that
abstract away the internals and only expose the interface (such as
pull-http-server). There should be no need to ever have node streams in the
browser
Important to note that, modules like spdy, won't have a rxjs or a pull-streams complete version (at least soon)
Yeah, chances are slim that will happen soon. But that's fine because in Node land wrapping the Node streams API isn't the worst. It supports backpressure so any other paradigm can me mapped on it easily, and perf should be comparable. Here's an example pull-streams http wrapper
I strongly recommend and vote that we keep exposing natural (language constructs based) API, such as callbacks, promises an streams. Requiring a user to buy into a library to be able to use js-ipfs creates a lot of friction.
Probably the only native abstraction capable of doing streaming stuff well
would be to have an explicit OO API to interface with the streaming state
machine, like Node's fs
file descriptor methods.
All other abstractions could then be built on top of that.
I'd be super wary of exposing multiple interfaces as JS, the language™ is a
moving target. E.g. what will the story be for async / await
, event emitters
, promise cancellables
. I reckon versioning and updating
interface-specific packages built around a single core API would be the most
sane approach, keeping everyone happy.
An example OO API:
const ipfs = require('ipfs')
ipfs('/my-cool/repo-name', (err, node) => {
if (err) throw err
const cat = node.files.cat(function (err, data) {
if (err) return console.log('error: ' + err)
console.log('chunk: ' + String(data))
})
cat.start()
cat.pause()
cat.stop()
node.stop()
})
For native abstractions you could totes create a plugin system to expose abstractions, if anything my playground repo has shown that wrapping the API isn't that hard:
// pull-stream
const ipfsPull = require('ipfs-pull-stream')
const pull = require('pull-stream')
const ipfs = require('ipfs')
ipfs('/my-cool/repo-name', (err, node) => {
if (err) throw err
node.use(ipfsPull())
const cat$ = node.files.cat('<hash>')
pull(cat$, pull.log())
})
// node streams
const ipfsNodeStreams = require('ipfs-node-streams')
const ipfs = require('ipfs')
const pump = require('pump')
ipfs('/my-cool/repo-name', (err, node) => {
if (err) throw err
node.use(ipfsNodeStreams())
const cat$ = node.files.cat('<hash>')
pump(cat$, process.stdout)
})
Does this make sense?
If you guys don't make a solid decision during this meeting i'm going to make the decision for you.
That is a threat I intend to follow through on.
Really awesome work @dignifiedquire in concretely identifying the pain points in dealing with node streams. I can't make the hangout, but here are my thoughts:
For me, pull streams' # 1 superpower is their simplicity (in the Rich Hickey sense of the word): anyone can write a full pull stream implementation from scratch in a few minutes, from first principles. This is not true of node streams in the slightest. Simplicity brings transparency with it, meaning debugging and reasoning about implementation gets easier. I think this is the biggest argument I can make for pull streams. In this sense, I bet js-ipfs would benefit tremendously.
That said, pull streams' lack mainstream awareness. Despite being simple, grokking pull streams is certainly not easy. It took me a while to wrap my head around the big ideas, and I still don't wield them as intuitively as node streams. Pushing IPFS contributors to grok pull streams could hurt the project's approachability.
Summary: deviating from node streams in 2016 seems like a high-risk move re raising the barrier to contribution, despite potential gains. As @diasdavid said, piece-wise solutions to each problem (like pump
) is probably the safest approach.
Thank you everyone for all the input, it was really valuable for all the discussion.
Me and @dignifiedquire talked for a while and given our options, we decided to give pull-streams a spin, however this is not a full buy in, because we still do not know if it will solve the entire problem for libp2p (specially secio) and ipfs-repo.
We hope to get this done no later by next Wednesday
Note: We could avoid migrating these to pull-streams and just do the LRU cache + strong verification at the Repo, but since we are moving libp2p to pull-streams, it makes sense to use the same type of streams everywhere, plus we get all the other conviniences.
we decided to give pull-streams a spin, however this is not a full buy in, because we still do not know if it will solve the entire problem for libp2p (specially secio) and ipfs-repo.
Can you give insights to what lead you to this decision? What were the pros and cons for pull-streams vs. RxJS?
@haadcode The most important reasons are described in my first post, here: https://github.com/ipfs/js-ipfs/issues/362#issuecomment-237530938
In a tl;dr:
As an alternative to RX, with backpressure possibilities, there is also the Reactive Streams specification (http://www.reactive-streams.org/). It is still not something that feels very natural compared to streams in nodejs (compare https://github.com/reactive-streams/reactive-streams-jvm/blob/v1.0.0/README.md).
@ekroon I saw them mentioned in the discussions around backpressure but it seems there is no implementation in JavaScript currently. Implementing something like this would add a lot of work for us without being sure that it's worth it (https://github.com/reactive-streams/reactive-streams-js is empty :()
reactive streams also looks like its its own network protocol.
@dignifiedquire @whyrusleeping in the simplest form reactive streams is only a set of interfaces that make it possible to subscribe to a stream and be notified of errors and the end of the stream (which is also in RX). In addition you can effectively switch between pull and push as there is an extra call on the subscription (normally only a cancel or unsubscribe), namely request(n); to request a maximum of new items. If the receiving end is limiting the flow by doing request(1) after a message, it would be effectively pull and when the sending end is limiting via request(bignumber); than it is effectively (unbounded) push.
Looking at node.js streams and pull-streams specifically, it looks to me that you can almost do the same with it. I am not completely sure if you can specify a lot of data and that way be a bit more pushy (I get a bit stuck when reading javascript ;)); and that reactive streams is (still) more a JVM party and not very wide spread, but that the interface itself would be mappable onto a pull stream implementation. (RxJs is pretty popular I thought nowadays for UIs, so there is hope for a reactive streams implementation)
As for a network protocol: it is not, but it is meant to be used within an application and crossing network boundaries. E.g. if I have an photo resize service split up in a resizer + stream chunker (creating JPEG or something) and a sender over the network (over some protocol providing flow control, TCP / SPDY ...), than I would be able to rate limit the sender by requesting only 1 new photo from the stream chunker which in turn could rate limit the TCP connection a little bit if needed.
for converting between node streams and pull streams: https://github.com/pull-stream/stream-to-pull-stream https://github.com/pull-stream/pull-stream-to-stream
stream-to-pull-stream has had a lot of battle testing and has been very reliable. in fact, it's worked so well that it's probably slowed down efforts to write lower level pull-streams which go directly to the libuv bindings. but those are popping up.
Closing as we have merged and released our first version of js-ipfs using pull-streams internally :)
Reasons for consideration
In the recent months of developing js-ipfs we had some issues with various parts of the different versions of node-streams. Some of the major ones were
error
handling, especially as wepipe
and wrap streams many times from module to moduleIn light of these issues we are exploring different alternatives to replace the stream based code inside js-ipfs
Currently I have identified two possible alternatives
1. Pull Streams
Example code
Resources
2. RxJS
Example code
Resources