nodejs / node

Node.js JavaScript runtime ✨🐢🚀✨
https://nodejs.org
Other
107.34k stars 29.48k forks source link

FileHandle.stream #38350

Closed ronag closed 2 years ago

ronag commented 3 years ago

It might be possible to have a more efficient way to pipe a file to a stream by adding a .stream method to FileHandle.

e.g.


FileHandle.prototype.stream = async function (dst, { start = 0, end, signal } = {}) {
 // TODO: this[kRef](), this[kUnref](), if (signal.aborted) throw AbortError()
  let pos = start
  while (true) {
    if (end && pos >= end) {
      break
    }

    const { buffer, bytesRead } = await this.read({
      buffer: Buffer.allocUnsafe(Math.min(end - pos, 16384)),
      position: pos,
    })

    if (bytesRead === 0) {
      break
    }

    if (dst.writableNeedDrain) {
      await EE.once(res, 'drain', { signal })
    }

    if (dst.destroyed) {
      break // TODO: break or throw?
    }

    dst.write(buffer)

    pos += bytesRead
  }
}
const fileHandle = fsp.open(filePath)
try {
  await fileHandle.stream(dst)
} finally {
  await fileHandle.close()
}

vs

await pipeline(fs.createReadStream(filePath), dst)

Less ergonomic but potentially better performance. This idea needs benchmarking.

ronag commented 3 years ago

Could also have a static alternative.

await fsp.stream(filePath, dst)
ronag commented 3 years ago

Another (less efficient) idea is to have a method that returns an async generator.

await pipeline(fsp.generator(filePath), dst)
kuzmaMinin commented 3 years ago

пожалуйста вась

Ayase-252 commented 3 years ago

Could I take this challenge? 😁

benjamingr commented 3 years ago

Why a stream method rather than adding Symbol.asyncIterator? Is the goal perf or ergonomics?

ronag commented 3 years ago

Why a stream method rather than adding Symbol.asyncIterator? Is the goal perf or ergonomics?

Performance.

jasnell commented 3 years ago

Just to expand on this a bit to add to the discussion. I don't have a timeline yet on exactly when I'd be able to continue working on this, but as part of the effort around enabling fetch (and a few other things), I've been looking at support for the web platform API standard Body mixin. My preference would be for whatever we do here to be aligned with that API.

So, for instance, let's assume that a FileHandle implemented the Body mixin:

const file = await fs.promises.open('file', 'rw');

file.body;            // A WHATWG ReadableStream
file.arrayBuffer();   // The content of the file as an ArrayBuffer
file.blob();          // The content of the file a Blob
file.formData();      // Doesn't really make sense here so probably good to omit
file.json();          // The content of the file as JSON
file.text();          // The content of the file as a string

Following this pattern, I'd suggest a file.readable() that returns a stream.Readable if the file is readable, and a file.writable() that returns a stream.Writable if the file is writable.

benjamingr commented 3 years ago

What about adding a Symbol.readable symbol like Symbol.asyncIterator that enables getting a readable stream version of stuff (like FileHandles)?

(Edit: obviously it'd be an import { ReadableSymbol } from 'stream' rather than monkey patching the global)

ronag commented 3 years ago

What about adding a Symbol.readable symbol like Symbol.asyncIterator that enables getting a readable stream version of stuff (like FileHandles)?

I would kind of like to move away from readables and towards async iterables. But maybe that’s a bigger discussion.

benjamingr commented 3 years ago

@ronag

I would kind of like to move away from readables and towards async iterables. But maybe that’s a bigger discussion.

Deprecating streams in favour of async iterables as the contract would be ideal (one stream type, only pull, all modern promises with ease of debugging and syntax assist) eventually would be ideal - it would also make incorporating whatwg streams nicer - but IIRC performance really isn't there.

ronag commented 3 years ago

@ronag

I would kind of like to move away from readables and towards async iterables. But maybe that’s a bigger discussion.

Deprecating streams in favour of async iterables as the contract would be ideal (one stream type, only pull, all modern promises with ease of debugging and syntax assist) eventually would be ideal - it would also make incorporating whatwg streams nicer - but IIRC performance really isn't there.

I think it depends heavily on whether it’s sync or async streams.

addaleax commented 3 years ago

I love this comment in #38440:

Why is pipe()/pipeline() so slow?

Because there is an obvious answer to that: Because we're not using a consistent streams model on the native side.

One of the reasons I tried to push for StreamBase adoption while I was a full-time contributor -- and, relevant to this particular issue, made FileHandle a StreamBase on the native side -- is that that way, we can have a much faster piping mechanism by just skipping the JS data layer entirely, and only moving data on the C++ side of things. (This isn't even particularly complex to implement: https://github.com/addaleax/node/commit/bcbf4597a9d29866c977141a9b50a7ae791f33ae. The only reason I haven't opened a PR with that back then is that it would only work for piping between readable FileHandles, sockets, and HTTP/2 streams, which didn't seem like a large enough set of use cases). And piping between native streams -- e.g. a file stream to zlib to HTTP to network or the other direction -- is a really common thing to do for Node.js users.

With suggestions like the implementation over in #38440, or the ones in https://github.com/nodejs/node/pull/38233, we're moving further away from that, because we're introducing more different ways of creating streams that are ultimately backed by native calls, yet make data available in a format that is only compatible through the JS interfaces they use.

From a software design point of view, that's not good. Unfortunately, the way Node.js is developed, with step-by-step iteration without much coordinated planning, does not lend itself to long-term design processes that require effort from multiple people very well (and typing this out makes me wonder if the RFC process suggested by @trott in https://github.com/nodejs/node/pull/37953 wouldn't be a really good idea here).

So, @ronag, @jasnell, I know you have opinions on things like StreamBase, but please keep this in mind. If you really care about Node.js streams performance, then the way to do that is really a consistent API on the native side, regardless of what API that is. If StreamBase doesn't fit our use cases anymore, great -- but please replace it with an API that is better, and don't just leave parts of Node.js on one model and use a new model on other parts, and then use the fact that we are using one API across the board to make Node.js fast.

jasnell commented 3 years ago

@addaleax, as we've discussed before, my only real concerns around StreamBase are the fact that it's extremely difficult to make any significant improvements to without breaking half of the Node.js ecosystem and the fact that it does not provide a way of doing the things I need it to do.

The QUIC implementation in #38233 does not move away from using StreamBase. It uses StreamBase directly. For instance, when doing something like stream.respondWith({ body: fs.promises.open('foo') }), the native layers will consume the FileHandle via the StreamBase API. The only difference is that there's another layer of internal buffering and incremental pulling of the buffered data that StreamBase simply does not implement any support whatsoever for. On the inbound data side, stream.readable() will be backed by a StreamBase to push the data out to the JavaScript side. The QUIC implementation is going to fully support the existing streams API -- it's just also going to support the ability to avoid using the streams API entirely.

As for #38440, I'm absolutely -1 on the way that implementation is done, regardless of the performance boost.

Ideally we would be able to make StreamBase faster and more flexible. Ideally we'd be able to do so in a way that doesn't break everyone. I'm not convinced yet that we can do both.

addaleax commented 3 years ago

@jasnell Right, and I can totally see that and I don't mind that you're introducing something for QUIC that fits your needs better, I'm just saying that it makes sense to think about how the APIs around that are going to look like in the long run, so that we can also make QUIC even faster in the future.

ronag commented 3 years ago

Because there is an obvious answer to that: Because we're not using a consistent streams model on the native side.

Also because streams are slowish. Both solutions here live in JS land and one is faster than the other.

ronag commented 3 years ago

I'm fine with not doing this. Was just an idea on how get a little better perf with FileHandle (I'm doing it like this in our perf sensitive code). In an ergonomic sense I would be fine with having FileHandle.readable() and/or FileHandle[Symbol.AsyncIteartor]().

jimmywarting commented 3 years ago

little late to the party... but how about something like

const file = await fs.getFile('./readme.md')
// file instanceof window.File
file.stream() // new whatwg:stream ReadableStream()
file.text() // Promise<string>
file.arrayBuffer() // Promise<ArrayBuffer>
await new Response(file).json()
await new Response(file).formData()
github-actions[bot] commented 2 years ago

There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.

For more information on how the project manages feature requests, please consult the feature request management document.

github-actions[bot] commented 2 years ago

There has been no activity on this feature request and it is being closed. If you feel closing this issue is not the right thing to do, please leave a comment.

For more information on how the project manages feature requests, please consult the feature request management document.