Open mejedi opened 3 years ago
POC #36672
A few months ago, @mhdawson and I published a paper on WebAssembly in Node.js with a focus on caching compiled code in cloud/server environments, so I'll share a few thoughts.
root
without changing the HOME
environment variable, and that could lead to privilege escalation in this case. Also, technically, some operating systems allow running processes from user accounts that don't have home directories.Personally, I believe the following two options are better than the proposed option that requires a potentially complex API in Node.js and changes in third-party packages to utilize it.
All of this can be implemented by third-party packages if Node.js provides a way of serializing and deserializing instances of WebAssembly.Module
. That used to be possible through v8.serialize
and v8.deserialize
, but that feature was removed by V8 and Node.js (https://github.com/nodejs/node/issues/18265). I was still able to implement this by hooking into V8 internals, but ultimately, the only correct approach here is for V8 to expose non-streaming serialization and deserialization APIs again.
Once that happens, packages have a great amount of flexibility. For example, during npm install
, a package can compile its WebAssembly modules and store them in its installation directory without any need to access the user's home directory etc. No cache collisions, no security issues, and excellent performance.
V8 currently only supports caching for streaming compilation. However, streaming compilation requires WhatWG streams, so that's not really an option at this point, IIRC.
This is also possible, without any code changes in userland, and it doesn't require streaming compilation. Just compute a hash of the WebAssembly module bytes ("wire bytes" within the V8 WebAssembly pipeline) and use the computed hash to identify the module within the cache. I benchmarked a few hash functions, and there is an obvious tradeoff between the imposed delay and the probability of hash collisions. For example, SHA1 and MD5 are vulnerable to some hash collisions, but can be computed with less than 10 CPU cycles per byte.
This approach allows Node.js to cache WebAssembly modules without any code changes. Users wouldn't need to think about caching, it would just magically happen.
The main difficulty is: where do we store the compiled code? One option would be to add a command-line flag --wasm-cache-dir=<value>
that must be specified to enable the cache. This could also be passed through the NODE_OPTIONS
environment variable.
@tniessen Thank you for tuning in!
Implicit caching
Implicit caching is no-go. Do openssl speed sha256
to get the idea how high the latency for a cache hit would be. This can be parallelised easily but still we are burning lots of CPU cycles for nothing. I can imagine V8 loading compiled wasm modules with mmap
some day, matching the performance of loading a native executable.
Explicit caching in third-party packages Once that happens, packages have a great amount of flexibility. For example, during npm install, a package can compile its WebAssembly modules and store them in its installation directory
Compiled module might become useless once Node is updated (depends on V8 version). Furthermore, there are command line options to enable experimental features (e.g. reference types) which result in incompatible binaries. We rely on third-parties doing things right. If they don't factor V8 version + options hash in the cache file name, and multiple Node versions are in use, the cache is essentially disabled. Furthermore, they should check file modification time for cache invalidation. That's why I was proposing an easy to use API for third-parties.
Compiled Wasm binaries could be large. Mine are over 100MiB. Having them stored in a dedicated directory makes it possible to limit the max cache size.
I'd love to be able to precompile wasm modules for my Docker images. This should be opt-in rather than happening during npm install. Would be great to converge on a convention and implement it in npm (like an option for install
).
Last, I'd like for WebAssembly compilation to detect if the same module is currently being compiled in a different process, and wait until the compiled artefact becomes available. (Use-case: a command line tool launched in parallel by make.)
Be careful where code is cached. Storing cached WebAssembly code in the user's home directory is not necessarily a good idea. The cached code contains CPU instruction sequences with virtually no protection against modifications, and security issues can arise if you load executable code from an unprotected directory.
Is it different from plain JS files?
Be careful when code is cached. Depending on the V8 version, platform and compilation method, V8 will use either the Liftoff or TurboFan compiler, or it will use Liftoff ahead-of-time and TurboFan just-in-time. You don't want to store the result of Liftoff, only the result of TurboFan.
Solved in the POC.
Looking forward to reading your article. Don't have ACM subscription, could you please share the text?
Implicit caching is no-go. Do
openssl speed sha256
to get the idea how high the latency for a cache hit would be. This can be parallelised easily but still we are burning lots of CPU cycles for nothing. I can imagine V8 loading compiled wasm modules withmmap
some day, matching the performance of loading a native executable.
Well, you can use faster fingerprinting functions, but they allow attackers to achieve cache collisions. (In our research, we used constant-time fingerprinting functions in non-security critical environments.)
Still, while significant, the overhead of sha256
is much smaller than the overhead of compilation.
Compiled module might become useless once Node is updated (depends on V8 version). Furthermore, there are command line options to enable experimental features (e.g. reference types) which result in incompatible binaries. We rely on third-parties doing things right. If they don't factor V8 version + options hash in the cache file name, and multiple Node versions are in use, the cache is essentially disabled. Furthermore, they should check file modification time for cache invalidation. That's why I was proposing an easy to use API for third-parties.
Trust me, I know. Worst case is, V8 will go back to compiling the module if the cached version is outdated or was compiled with incompatible options.
I 100% agree that an "easy to use API for third-parties" would be great to solve these potential issues, but it doesn't really have to be in Node.js core. The important aspect is exposing simple and reliable serialization and deserialization APIs for WebAssembly from V8/Node.js, and then npm packages can provide the caching implementation.
Last, I'd like for WebAssembly compilation to detect if the same module is currently being compiled in a different process, and wait until the compiled artefact becomes available. (Use-case: a command line tool launched in parallel by make.)
We did implement something like this within the cache architecture, but our design doesn't really apply to the general case. IPC between arbitrary Node.js processes isn't exactly simple. Sure, you could simply write lock files to the directory, but that doesn't necessarily make things simpler, especially on NFS.
Be careful where code is cached. Storing cached WebAssembly code in the user's home directory is not necessarily a good idea. The cached code contains CPU instruction sequences with virtually no protection against modifications, and security issues can arise if you load executable code from an unprotected directory.
Is it different from plain JS files?
Yes, we don't store compiled code anywhere in the file system, partially due to security concerns. If you are talking about the plain JS code, by default, we only load JS files from the application installation directory, which can have much stricter permissions than the rest of the file system. (Especially if you install packages globally.) Also, there are approaches to sandbox Node.js applications (without the overhead of containers), and loading executable instruction sequences from an unprotected directory can bypass those sandboxes.
Looking forward to reading your article. Don't have ACM subscription, could you please share the text?
The preprint is available for free here. It's just the 10 page paper, there's also a 150 page analysis with far more explanations, experiments, and results, and I'll hopefully be able to share that soon.
@tniessen Thank you for sharing the paper. Is it fine to circulate the link?
there are approaches to sandbox Node.js applications (without the overhead of containers)... by default, we only load JS files from the application installation directory...
Is there a sandbox for Node.js in a widespread use, or is it rather a research? I wonder what is the threat model here exactly. Can you limit resource consumption (CPU, RAM, filesystem quota) w/o containers? Even though code is only loaded from safe locations by default, how do you ensure that program doesn't alter loader settings due to neglect or malice?
In my experience, containers impose little overhead (less than 10ms to set up).
I've been working on a generic wasm-run
utility, and inability to cache compiled wasm makes it really slow in some scenarios.
far more explanations, experiments, and results, and I'll hopefully be able to share that soon
Released a few days ago via UNB Scholar.
Congrats on getting that published @tniessen :-)
I have been thinking more and more lately about whether it would make sense for require()
and import
to include first class support for loading wasm modules.
So instead of...
const obj = await WebAssembly.instantiateStreaming(fetch('test.wasm'))
obj.instance.exports.func();
We could essentially have...
const obj = await import('test.wasm');
obj.func();
and...
const obj = require('test.wasm');
obj.func();
Obviously the require()
option wouldn't really be that portable outside of Node.js but the import approach as essentially just a wrapper for the WebAssembly boilerplate with caching would make this easier to consume.
We were planning to add this to esm, but we're waiting on https://github.com/WebAssembly/esm-integration/
I put together a small cache impl on top of my fetch impl here: https://github.com/devsnek/node/commit/b07e08342fbc06a73bed66ca3c8486cc000f0736
you can set it using code like this:
require('v8').setWasmStreamingHandler({
get(url) { ... },
set(url, buffer) { ... },
delete(url) { ... },
});
Can't believe that there is still no way to cache compiled wasm module... any chance it will be possible soon?
I would also love to see this feature in NodeJS. My ~45MB WASM project starts reasonably fast in the browser, but takes minutes to start up in NodeJS. Given that the esm-integration proposal is only in stage 2/5, I would expect it to take quite a while (at least months, maybe more than a year?) until this feature is finished and standardized.
@devsnek, can the cache implementation that you mentioned be used today? Naively asking: Could I apply this commit on top of the NodeJS master branch and expect it to work? I would assume that I could write the buffer
argument of the set
function to a file? And respond with the contents of that file, when the get
function is called?
There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.
For more information on how the project manages feature requests, please consult the feature request management document.
now that instantiateStreaming
is landed, it might be interesting to take another look at this. @tniessen do you have any plans to do so? if not i might noodle with it.
@tniessen do you have any plans to do so?
No, not in the near future. When I previously worked on caching compiled WebAssembly code, our goal was to scale to large clusters of application processes and to make security guarantees by only loading compiled code from memory areas that were guarded by a trusted supervisor process, which also ended up being much faster than loading compiled code from disk. But with Node.js, the security aspect is difficult for us to manage (i.e., we probably don't want to store compiled code in an unprotected location).
There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.
For more information on how the project manages feature requests, please consult the feature request management document.
Just making a comment to keep the issue open. I imagine there's a demand to see this implemented when/if it becomes feasible.
There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.
For more information on how the project manages feature requests, please consult the feature request management document.
Noooo
There has been no activity on this feature request for 5 months and it is unlikely to be implemented. It will be closed 6 months after the last non-automated comment.
For more information on how the project manages feature requests, please consult the feature request management document.
Keep alive please, Mr bot.
There has been no activity on this feature request for 5 months. To help maintain relevant open issues, please add the https://github.com/nodejs/node/labels/never-stale label or close this issue if it should be closed. If not, the issue will be automatically closed 6 months after the last non-automated comment. For more information on how the project manages feature requests, please consult the feature request management document.
I believe it should be marked as never-stale
, but I have no rights to add that label
What is the progress on this issue?
Is your feature request related to a problem? Please describe.
Large WASM modules take a while to compile, impacting startup times. Browsers solve this problem by caching compiled module data.
WebAssembly is a promising media for software distribution. It makes sense to compile for WebAssembly once, instead of shipping multiple binaries for each supported CPU architecture / OS pair, doesn't it? There's a growing number of tools targeting this particular segment, e.g. https://wasmer.io.
Node.js has unique advantages as a WASM runner:
WASI as a standardised interface for interfacing with the OS (e.g. accessing files, etc.) is insufficient for most practical needs. With Node, one can leverage a plethora of high quality battle-tested cross platform libraries.
V8 is top notch! It beats competing WASM engines in terms of compile times, resource consumption and the footprint of compiled WASM files (a single data point: V8 produces 160MiB for a 50MiB WASM file, a competitor generates 1+GiB).
NPM is super robust. Competitors have their own package managers but not particularly reliable ones.
The only component missing in Node.js is a compiled module cache.
It takes 47s to compile the previously mentioned 50MiB WASM file. With a cache POC, the startup time is reduced to under 1s.
Describe the solution you'd like
as a cache-enabled moral equivalent of
Cache files to be stored in OS-mandated cache directory, e.g.
~/.cache/node/wasm-cache
on Linux.Describe alternatives you've considered
It used to be possible to serialise a WebAssembly module to a file explicitly. #18265
It stopped working since Node.js 13 due to changes in V8 and there's no way to make it work again.
Instead of introducing a new API, it is possible to enhance
WebAssembly.compile
/WebAssembly.compileStreaming
. Unfortunately, it's hard to come up with a good cache key. We couldsha256
the data, but that's inefficient.