Research the feasibility to providing a VFS extension mechanism to `fs`, `require()`, `child_process`, etc

jviotti commented 2 years ago

Refer to https://github.com/nodejs/single-executable/pull/31#discussion_r963126103 for a better explanation of the problem.

Monkey-patching all of these functions is non-trivial and requires a lot of code. The user-facing surface of this is pretty large, and I wonder if providing an extension point at the lower levels is actually easier.

Can we do some research to evaluate the feasibility of this?

@RaisinTen You have dug a bit on this on Postman's internal bootstrapper patch, but we do welcome any help or insights from the community.

RaisinTen commented 2 years ago

Btw, after https://github.com/nodejs/node/pull/44537 landed, we are having some related conversation in https://github.com/nodejs/node/pull/44713.

RaisinTen commented 2 years ago

The user-facing surface of this is pretty large, and I wonder if providing an extension point at the lower levels is actually easier.

Sounds like https://github.com/nodejs/node/pull/39711#issuecomment-895643026 is a potential approach we can try experimenting with:

// node.h
class UvDelegate {
 public:
  explicit UvDelegate(node::Environment* env);

  virtual int uv_fs_open(...);
  virtual int uv_fs_close(...);
};

// node.cc
void InternalModuleReadJSON() {
  ...
  int fd = env->delegate()->uv_fs_open();
  ...
}

// user.cc
class EmbedderUvDelegate : node::UvDelegate {
  int uv_fs_open(...) override;
  int uv_fs_close(...) override;
};

node::CreateEnvironment(..., delegate, ...);

or something simpler:

// node.h
struct UvAPIs {
  UvFsOpenFunc uv_fs_open = uv_fs_open;
  UvFsCloseFunc uv_fs_close = uv_fs_close;
  ...
};

// user.cc
UvAPIs uv_apis;
uv_apis.uv_fs_open = ...;
uv_apis.uv_fs_close = ...;

node::CreateEnvironment(..., uv_apis, ...);

but I'm not sure if this is actually "easier".

jviotti commented 2 years ago

@RaisinTen Yeah, this is exactly what I'm talking about!

but I'm not sure if this is actually "easier".

I do think it's easier from an "amount of code" point of view and I'd be interested in exploring this further. Maybe do a demo patch if you are up for it @RaisinTen ?

The only limitation I can see here is that while it does make it potentially easier to implement a VFS on C++, it would be hard to use the same APIs from JavaScript, right? I wonder if we could also expose the same "delegate" concept or something like that to JS too?

RaisinTen commented 2 years ago

The only limitation I can see here is that while it does make it potentially easier to implement a VFS on C++, it would be hard to use the same APIs from JavaScript, right?

My understanding was that the patching would take place completely at the C++ level and there should be no extra APIs exposed to JS?

arcanis commented 2 years ago

It's still unclear to me why you want/need C++ hooks? Why isn't patching fs enough?

RaisinTen commented 2 years ago

Patching fs is problematic because it's hard to keep those patches in sync with the upstream fs API changes from node. This is a problem Electron's ASAR has faced and we have sent a PR to address at least one of the API changes in https://github.com/electron/electron/pull/34764.

arcanis commented 2 years ago

Yes, I'm familiar with this problem; still:

if the goal of this project is to have one "blessed" VFS implementation specifically made for reading files from archives, then it doesn't have to have a hook API, it can be integrated with Node. By this point, having it in JS is fine, since tests will be part of the Node codebase itself.
if the goal is instead to allow for multiple VFS implementation, then native is out of the question since it's not portable (Node scripts and packages are portable by default, and I think it's something very important to preserve).

And for experimentations, patching fs ourselves isn't perfect for the reasons you said, but we've done that for some time and we can keep doing it until we find a format we all agree on. We probably don't need a C++ API just for this?

RaisinTen commented 2 years ago

if the goal is instead to allow for multiple VFS implementation

I think that was the general agreement.

We probably don't need a C++ API just for this?

I don't think doing C++ hooks is a strict requirement. We're trying to find the most optimal level where allowing the configuration of these APIs would reduce maintenance burden for both node and userland. It probably should be somewhere between monkey-patching of node's fs apis and allowing overloads for libuv c++ functions.

Today @jviotti and I were discussing about doing this in the fd level by just monkey-patching fs.open but it seems that there are a lot of path based APIs that would remain unaffected, so we should think of a different approach.

We also realized that supporting the plain monkey-patching approach would require patching:

fs
child_process
cjs module loader - Module._stat and Module._readPackage (not sure if esm needs patching of some other components)

I wonder if we can come up with a single API that allows users to configure all of these at once.

arcanis commented 2 years ago

if the goal is instead to allow for multiple VFS implementation

I think that was the general agreement.

This wasn't my understanding, and under this prompt I believe the work drastically changes. If, instead of finding a blessed format, and integrating it transparently inside Node itself (similar to phar, eggs, or jar archives), the intent is to simply make it easier to support multiple VFS implementations, then:

It's much less useful imo, performance and transparent behaviors were the two main user-facing advantages I saw to this project, over existing solutions.
Node.js already has the proper layers (patching for CJS, a resolver overriding node:fs for ESM). All that remains then is to build a proper way for userland authors to test that their VFS implementations have the expected behavior compared to the standard Node APIs (essentially, a testsuite-as-a-library). And potentially to add a couple of VFS in CITGM, although it's almost there already, only waiting for https://github.com/nodejs/citgm/pull/905 to land.

GeoffreyBooth commented 2 years ago

cc @nodejs/loaders

jviotti commented 2 years ago

if the goal of this project is to have one "blessed" VFS implementation specifically made for reading files from archives, then it doesn't have to have a hook API, it can be integrated with Node. By this point, having it in JS is fine, since tests will be part of the Node codebase itself.

The observation here is that even if we provide a "blessed" VFS implementation in the context of SEAs, there are other projects that implement VFSs for reasons other than SEAs, and they still monkey-patch a lot of functions. If there is a way we can provide hooking capabilities in Node.js, then we would both solve the SEA problem AND solve the overall monkey-patching situation that everybody seems to agree sucks and should not happen (and it's incompatible with ESM).

I think that if we widen our scope a bit more, we might be able to come up with a solution that fixes the entire problem for once and for all.

That said, I don't think blocks the rest of the SEA work. For now, we can monkey patch as everybody else is doing (or patch Node.js directly for the sake of our experiments) while we try to find a proper solution to the problem, and hopefully use that later on.

This wasn't my understanding, and under this prompt I believe the work drastically changes. If, instead of finding a blessed format, and integrating it transparently inside Node itself (similar to phar, eggs, or jar archives), the intent is to simply make it easier to support multiple VFS implementations, then:

We want to find a "blessed" one, but based on all the previous discussions we had over the past weeks, it seems that identifying a VFS that checks all the boxes will be tricky. The decision that came out of the last meeting was:

Accepting that we don't know of a VFS implementation that checks all the boxes
Embracing that knowledge gap as the ground for further experimentation on the VFS area
Run SEA experiments with multiple VFS and see what we learn from that
Then properly revisit the problem of selecting a blessed one

For point 3, we will need to integrate with many VFS (for the time being) as we experiment with them, so the problem of hooking in into Node.js becomes more problematic for us. This is all internal experimentation and we would hope to ship a blessed VFS at some point, but we don't know which yet.

CMCDragonkai commented 1 year ago

I wrote a VFS a long time ago https://github.com/matrixai/js-virtualfs and indeed Node's FS API has changed quite a bit since then. My company hasn't had resources to keep it up to date since we moved to js-encryptedfs.

If nodejs itself supported a FS interface "blessed", then enabled different underlying implementations, that might be quite useful.

joyeecheung commented 11 months ago

I think to support VFS there are two things that need to be added to Node.js core:

The ability to embed at least some assets into SEA https://github.com/nodejs/single-executable/issues/68
The ability (likely a programmable API) to intercept file system access performed through any Node.js API and potentially return something that's different from what is actually in the file system.

How 2 can be used on top of 1 to build a fully functional VFS can be either built-in or be left to the user land. But at least 1 and 2 have to live in core for a reliable VFS implementation. It seems to me 2 is probably more like a Node.js core issue, as it also matters for use cases beyond SEA e.g. tests mocking the file system, and it needs some proper mechanism to e.g. work with the permissions model (instead of introducing a leak in that model).

Qard commented 11 months ago

IMO it makes the most sense to implement VFS as a provider of the FileSystem standard API along with a provider for direct fs access. The API is a lot more conducive to virtualization. We could also modify workers to take a FileSystem to "jail" its fs access to that virtual file system, routing all fs access (including require/import) through it. When no FileSystem is provided it would just inherit the direct-to-disk FileSystem as the root which everything hangs off. This could be useful for flexible sandboxing; could technically even be nestable. Might also be worth providing helpers to build jailed FileSystem instances scoped to particular directories, or build archive files to pass around.

joyeecheung commented 11 months ago

Modelling the hooks based on FileSystem sounds like a good idea, that can save us a lot of design work. I think for a proper VFS that works with existing code though we may need some sort of synchronous extensions as FileSystem is mostly async, while many of the APIs that users need to intercept are probably synchronous.

joyeecheung commented 11 months ago

Oh actually there is also a synchronous variant of the FileSystem handles https://fs.spec.whatwg.org/#api-filesystemsyncaccesshandle

arcanis commented 11 months ago

Those interfaces are high level and lack many features of the typical fs API. A virtual filesystem would need to map at the "syscall" level, with a couple of higher level utilities for performances (same as readFile is a wrapper over open + read + close, but those functions are still available).

Qard commented 11 months ago

There could be an internal handle with more power which the FileSystem instance wraps which could be extracted for the more power-user bits to use. For example, when used as a "jail" for a worker it could extract that handle as a base for the fs module to do all the normal fs things on. From the user perspective though, they would have just constructed the standard FileSystem on the parent thread.

I think as much as possible we should try to stick to exposing standard interfaces, even if they're just wrappers of more powerful interfaces for the cases where standard APIs are insufficient.

mcollina commented 11 months ago

I think a lot can be obtained by registering new URL schemes in our filesystem API, e.g.fs.readFile('yarn://mymodule.

This has a few benefits:

avoid most overhead for the current fs calls
minimize breakage
compatible with the ESM/loader ecosystem
compatible with SEA

An alternative with similar properties is "mounting" a filesystem on a given path.

I would stay away from providing an "overlay" filesystem, as I think it would have a significant issues with performance.

arcanis commented 11 months ago

I don't think the Node API should enforce any kind of format over the paths. If it goes through fs, it should be available to the FS layer implementers.

Mounting paths would be vastly better than a protocol[^1], but it still feels like an arbitrary restriction to me; whether the filtering is performed by Node or by the userland code, the performances will be the same.

Also keep in mind we have prior work here. Both Electron and Yarn have been operating production-grade virtual filesystems for many years now, and perhaps there's no need to reinvent the wheel.

[^1]: A protocol would break a major chunk of the ecosystem relying on path manipulation. Anything using path.resolve or path.relative would be unable to use virtual filesystems.

mcollina commented 11 months ago

Mounting paths would be vastly better than a protocol1, but it still feels like an arbitrary restriction to me; whether the filtering is performed by Node or by the userland code, the performances will be the same.

I agree with you on mounting a path. We do not agree on the performance of filtering on top on an existing path. The problem is that in that case, you'd need to check both the vfs and the actual file system for each API call, and mix the result. This is going to be very complex, likely bug prone and difficult for certain APIs.

Have you got links to the vfs implementations of electron and yarn handy?

joyeecheung commented 11 months ago

I imagine we would just implement it as a if (UNLIKELY(realm->has_fs_delegate()) kind of check in C++ and if there is a delegate, we can call a user-defined delegate (we can call into JS too if that’s configured). And for certain commonly used fs operations (fs.open etc) we can just coalesce the calls in JS. That looks sufficiently performant to me if we are only checking a boolean.

GabenGar commented 11 months ago

Electron: bundles an entire v8 runtime and a chrome, so even a hello world is at least 200MB in size. Also comes with loads of custom interfaces to cooperate the two. Not exactly a plug'n'play nodejs solution. Yarn: calling its vfs implementation "production-grade" is quite a stretch, since I have yet to see a repo which uses yarn above version 1. My experience of migrating from version 1 to 3, which ended up migrating to npm due to various issues, such as requiring a custom extension for IDE (that still means typescript won't work with them) and a pretty cryptic docker config step. So not very production-ready even if the idea is pretty interesting.

The core problem of tying vfs to existing fs API is ambiguity of path resolutions. Let's say we have a potential SEA with this structure (details of internal implementation don't matter):

index.js
/a/index.js
/b/index.js

Then we put sea.exe file into a folder which ends up in this structure:

/sea
  /index.js
  /a/index.js
  /b/index.js
/sea.exe

Where does the path /sea/a/index.js point to? Having /sea.exe/a/index.js is not an option, since it implies /sea.exe is a folder. Which is not true, because it's clearly a file, and dirent.isFile() will be true, which is quite a footgun for anyone accustomed to file system APIs but not the vfs specifics. And we didn't even get to multiplatform issues, of which there will be plenty. So the "smart" way, aka fs modules automagically understanding vfs, will have a bunch of side effects and unintuitive (and potentially inconsistent between platforms) assumptions about the folder where vfs file is called. The other way is to create a module to interact with vfs files in NodeJS. This has the benefit of separating (potentially inlined) vfs calls from normal fs calls inside a given vfs without relying on arcane tricks which require monkey-patching fs in the first place.

mcollina commented 11 months ago

Here is the kind of API that I would like to see:

import { mount } from 'node:fs'

await mount('/path/to/nonexistent/dir', {
  async open (path) {
    return new FakeFDHandle()
  },

  async createReadStream (fakeFdHandle) {
    // ...
  }

  // not sure I would support anything else as most things can be derived by those two
})

arcanis commented 11 months ago

Have you got links to the vfs implementations of electron and yarn handy?

Electron: ASAR (implementation)
Yarn: @yarnpkg/fslib (FakeFS interface)

Generally the ASAR implementation is a much more limited version of Yarn's one; since Yarn is used for a very large variety of tools (Eslint, Parcel, Next, ...) we follow very closely the Node.js fs API. By contrast Electron can get away with skipping some rarely used primitives like fstat etc.

Here is the kind of API that I would like to see:

We do something similar, it's just that the check is left up to the implementer (because what constitutes a virtual file is rarely decided based on the folder; it's rather the file being accessed):

async function readFile(p, opts, nextReadFile) {
  if (p.includes('.zip/')) {
    return readFileFromZip(p, opts);
  } else {
    return nextReadFile(p, opts);
  }
}

This mechanism is also closer from how other loader hooks work, as they delegate to the next resolver:

async function resolve(specifier, context, nextResolve) {
  if (specifier.match(/whatever/)) {
    return myCustomResolve(specifier, context);
  } else {
    return nextResolve(specifier, context);
  }
}

mcollina commented 11 months ago

I think the API node.js should implement is the minimal one needed to allow 3rd party modules to hook it to specific file formats. The reason why I mention mounting something to a specific folder is that that can be checked extremely quickly via a prefix trie.

GabenGar commented 11 months ago

Here is the kind of API that I would like to see:

Okay, what is the expected behaviour of the resulting VFS (and by extension SEA) file? mount() is a quite specific name and can mean somewhat different things in different platforms. Considering a lot of fs functions are named after linux utils, it can be quite confusing to have a method with the same name but mechanics unrelated to at least linux mount(). Hence why I think it should be a separate module, because it's only tangentially related to fs, as in you need to open the file of the VFS using the host file system interfaces and at that point it's almost custom logic, which might or might not involve host file system interfaces.

mcollina commented 11 months ago

In order for all things in the ecosystem to work correctly, support needs to be built in inside the node:fs module, because dependencies uses that to load files all the time.

The name of the function can be whatever we want: exposeVirtualFileSystem might be long and descriptive enough.

bmeck commented 11 months ago

I'd be a bit wary of jumping quickly to perf things like a prefix trie; for small numbers of mounts (I suspect this will be the normal case) there are likely other more efficient checks that could be done.

mcollina commented 11 months ago

Agreed. What I want to avoid is to have to iterate through a full list of regexp for each mount for every file system operation.

GeoffreyBooth commented 11 months ago

When we designed import { register } from 'node:module' we envisioned the pattern would be repeated for other systems. So import { register } from 'node:fs', and it can either directly register an object whose properties are functions or it can register a module that's run in a separate thread like the module hooks. The advantage of the latter is that users can write async functions that can run as hooks within sync calls, for example an async open hook that can run within readFileSync.

jviotti commented 5 months ago

Closing this a I think we did a lot of research already. Any action items can be extracted a new issues

nodejs / single-executable

Research the feasibility to providing a VFS extension mechanism to `fs`, `require()`, `child_process`, etc #37