microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
162.01k stars 28.49k forks source link

Add vscode.workspace.fs.createWriteStream(). #84515

Open ericsnowcurrently opened 4 years ago

ericsnowcurrently commented 4 years ago

(see #84175)

For the Python extension, we currently use node's fs.createWriteStream() for a variety of purposes:

  1. downloading multi-MB files (and tracking progress)
    • calling into a third-party library that expects an fs.WriteStream
  2. writing generated data out to a file a chunk at a time
  3. writing log data out to a file, one entry at a time

For example:

@jrieken

ericsnowcurrently commented 4 years ago

From https://github.com/microsoft/vscode/issues/84175#issuecomment-551445383:

we have some proposed api but nothing is final yet. @ericsnowcurrently What is the used for? Real big files or just because?

See above.

jrieken commented 4 years ago

Let's use this issue to track the open, close, read, write API. That quite low-level but the building block for stream.

https://github.com/microsoft/vscode/blob/ffe3749d5afb046f289d04f915d8d4ade47c028e/src/vs/vscode.proposed.d.ts#L152-L158

bpasero commented 4 years ago

From my recent experiences from implementing faster read/write IO for remote connections, I believe that the primitives are not a good choice and would rather swap them by a stream solution that has similar support (e.g. start offset + length).

jrieken commented 4 years ago

Adding July for discussion

jrieken commented 4 years ago

This won't happen so soon. Discussion topics are mostly around streams vs chunks. The latter is the building block of the former and I believe that providers will have an easier job with chucks and consumer will have an easier times with streams. Streams might be become more interesting when we decide to investigate into supporting a fetch-compatible API which also works around streams...

jrieken commented 4 years ago

Good read on stream APIs targeted for browsers: https://developer.mozilla.org/en-US/docs/Web/API/Streams_API

jrieken commented 3 years ago

JS supports async generator functions and async iterators and they will offer a neat way to implement this. Much more simple than stream, yet equally powerful. Tho, the primitives that we have today are usually at the bottom of all stream/async-iterator solutions.

export interface FileSystemProvider { 
    readFile(uri: Uri, token: CancellationToken): AsyncIterable<Uint8Array>
 } 
connor4312 commented 2 years ago

We'd like to take advantage of the consumer side of this API in the hex editor, and we've already added code, that uses the Node.js native filesystem API when possible, since the extension host API doesn't provide this yet. Getting a read(2)/write(2) analog would be quite useful 🙂

For the hex editor, a stream-only interface would not be sufficient, or at least the one proposed in https://github.com/microsoft/vscode/issues/84515#issuecomment-868503753 which lacks a starting offset. We want to load data in incrementally, and if the user scrolls from byte 0 to byte 2GB, we don't want to have to read and discard everything in between.

Likewise for writing -- although we do need to rewrite if the file length changes, if a single byte in a 2GB file is edited there's no need to rewrite the whole thing.

Going with a more primitive approach, I would not have a simple fd: number and instead have an API something like how we do for bulk operations and edits:

export interface FileSystemProvider { 
    open(uri: Uri, options: { create: boolean; writable: boolean }, (handle: FileHandle) => Thenable<void> | void): Thenable<void>;
}

export interface FileHandle {
  read(pos: number, // ...)
  // ...

The native open(2) as well as fs.promises.open() lock the file on a filesystem level while changes are being made to provide consistency. It would make sense for this API to have the same guarantees. Wrapping the file handle in a bulk-style callback like this ensures disposal and discourages long-lived use of the FileHandle in a way that would cause locking problems for users. We could also provide a warning if a lock is held too long. Its nicely encapsulated lifecycle means that appropriate progress indicators can be shown on the file while writes happen if options.writable is set.

bpasero commented 2 years ago

The native open(2) as well as fs.promises.open() lock the file on a filesystem level while changes are being made to provide consistency.

Are you sure? On all platforms?

It is unsafe to use fs.write() multiple times on the same file without waiting for the callback. For this scenario, fs.createWriteStream() is recommended.

https://nodejs.org/docs/v14.16.0/api/fs.html#fs_fs_write_fd_buffer_offset_length_position_callback

I just recently added write locking for the node.js based file system provider via Barrier: https://github.com/microsoft/vscode/commit/9df559fdbbe7b75879fa7070dfc2588b4ada4565#diff-43c296d8b387331e4b0756c4f496d4f6102b1808554f598c748cbc4d8b33096d

connor4312 commented 2 years ago

Are you sure? On all platforms?

At least when I tried it on Windows in O_RDWR | O_CREAT mode. It was for that reason I added code to release the file descriptor when not in use.

It is unsafe to use fs.write() multiple times on the same file without waiting for the callback

I don't think is contradictory, it's saying that if your application calls open() and gets the FileHandle back, multiple write operations should not happen on that file at the same time. The file is still locked for other processes.

bpasero commented 2 years ago

👍 didn't know that.

I like the idea of providing an API that would reduce the chances of an extension forgetting to close the file handle because that would only ask for trouble. Now that the disk provider locks from the open call, we do not allow any more writes until close is called. There is no warning printed either.

jrieken commented 2 years ago

For the hex editor, a stream-only interface would not be sufficient, or at least the one proposed in #84515 (comment) which lacks a starting offset.

Yeah, that's the shortened variant of the existing read-function which does take an offset, like write. I do like enforcing the "transaction" character by the callback, but I also like async-iterables...

connor4312 commented 2 years ago

Why not both?

dbaeumer commented 2 years ago

Stupid question: how would I as an extension access the chunk API ?

The current proposed API allows that a file system provider implementor can provide these but I didn't find any API how I would call that API. I can't get to an individual FileSystemProvider and the FileSystem interface can not offer these methods unless we add a default implementation or throw if they are not available on the underlying provider.

jrieken commented 2 years ago

Yeah, the proposal doesn't expose them on the "consumer" side yet

AlbertoPdRF commented 2 years ago

Hey!

I think this is the right place, so I'm going to drop the question: as of now, is it possible to read a large file in chunks using vscode.workspace.fs? I believe not, but I may have missed something from the conversation above. If that's the case, will it/when will it be possible?

Thanks!

AlbertoPdRF commented 2 years ago

By the way, I also found the related #41985 in the backlog, which can maybe be closed to avoid duplication as per https://github.com/microsoft/vscode/issues/84515#issuecomment-559051830.

bpasero commented 2 years ago

I think they are different proposals. One for POSIX primitives and one for a method to write via a stream.

AlbertoPdRF commented 2 years ago

Is there any chance that the proposed fsChuncks API finalization gets included to September's iteration plan?

AlbertoPdRF commented 1 year ago

Hey! Is there any information on this? Could the proposed fsChuncks API finalization get included in any of the coming monthly iteration plans? If any help is needed, I would be happy to help!

AlbertoPdRF commented 1 year ago

Hi @bpasero @jrieken, is there any chance that the finalization of the fsChunks API is included in any of the coming monthly iteration plans? Is there anything I can do to help move this forward? Or any chance to use the API even if it's not finalized?

I'd like to at least know the status on this, as I've had a PR (https://github.com/AlbertoPdRF/root-file-viewer/pull/20) solving two issues blocked by this for almost a year now.

AlbertoPdRF commented 1 year ago

Hi @bpasero @jrieken, is there any chance that the finalization of the fsChunks API is included in any of the coming monthly iteration plans? Is there anything I can do to help move this forward? Or any chance to use the API even if it's not finalized?

I'd like to at least know the status on this, as I've had a PR (AlbertoPdRF/root-file-viewer#20) solving two issues blocked by this for almost a year now.

@mjbvz, maybe you can reply?

zm-cttae commented 1 year ago

As its backlogged, they have other work they are presumably looking at atm. Although big +1 to this feature

rbuckton commented 7 months ago

@jrieken: I'm running into an issue with DeoptExplorer (https://github.com/microsoft/deoptexplorer-vscode) when attempting to parse very large log files due to workspace.fs.readFile having a 2GiB memory limit. I'd originally used NodeJS's fs to open a readable stream to read a single line at a time, but I lost that capability when I switch to using workspace.fs. Unfortunately, V8 isolate logs can get fairly large and there's no easy way to trim them down.

A streaming API would be a useful building block, but a workspace.fs.readLines async generator method that can yield a line at a time without needing to load the whole file into memory would also serve my needs.

Lramseyer commented 5 months ago

Since this seems to be the go-to discussion thread for the fsChunks API proposal, I'd also like to +1 this. I'm building a digital waveform viewer extension. Those files can be on the order of gigabytes, so being able to read individual chunks of a file would greatly improve performance and make it useful for a lot more projects.

Here's my extension: https://github.com/Lramseyer/vaporview