APIs for libraries/frameworks/tools to control on-disk compilation cache (NODE_COMPILE_CACHE)

joyeecheung commented 1 week ago

Spinning from https://github.com/nodejs/node/pull/52535#issuecomment-2059390083

Currently, the built-in on-disk compilation cache can only be enabled by NODE_COMPILE_CACHE. It's possible for the end user to control where the NODE_COMPILE_CACHE is stored and so that it's also possible for them to find the cache and clean it up when necessary. That's the simplest enabling mechanism for sure, but from the use cases of v8-compile-cache (a package that monkey-patches the CJS loader, which is a capability that we want to sunset, see https://github.com/nodejs/node/issues/47472). It's also common for library/framework authors to want to enable this in a more flexible manner. So this issue is opened to discuss what an API for this should look like and what the directory structure of the cache should look like.

With the global NODE_COMPILE_CACHE the current cache directory structure looks like this:

- Compile cache directory (from NODE_COMPILE_CACHE)
  - $version_hash1: CRC32 hash of CachedDataVersionTag + NODE_VERESION (maybe we need to add UID too)
  - $version_hash2:
    - $module_hash_1: CRC32 hash of filename + module type.      <--- cache files
    - $module_hash_2: ...
...

For reference v8-compile-cache's cache directory looks like this

- $tmpdir/v8-compile-cache-$uid-$arch-version
  - $main_name.BLOB: filename of the module that `require('v8-compile-cache')`, or process.cwd() if it's not required in-file
  - $main_name.MAP:
  - $main_name.LOCK

And inside the .BLOB files it maintains a module_filename + sha-1 checksum -> cache_data storage. In the documentation it explains:

The cache is entry module specific because it is faster to load the entire code cache into memory at once, than it is to read it from disk on a file-by-file basis.

In my investigation when implementing NODE_MODULE_CACHE though, there's actually not much performance difference in reading on a file-by-file basis, at least when it's implemented using native FS calls and when the file only gets loaded when the corresponding module is about to get compiled (so not all the cache is loaded into the process at once even though the module might not be needed by the application at all - which v8-compile-cache does).

For third-party tooling (e.g. transpilers, package managers) I think the layout that don't distinguish about entrypoints would still be beneficial - as long as the final resolved file path remains the same and its content matches the checksum, and it's still being loaded by the same Node.js version etc., then the cache is going to hit. Then if multiple dependencies in the same project try to enable it, we wouldn't be saving multiple caches on disk even though they are effectively caching the code for the same files (e.g. the end user code needs package foo that resolves to /path/to/foo.js, whose cache gets repeatedly stored in the cache enabled by a transpiler and then again in the cache enabled a package manager that executes a run command).

I wonder if we should just provide the following APIs:

const module = require('node:module');  // Or import it

/**
 * Enable on-disk compiled cache for all user modules being complied in the current Node.js instance
 * after this method is called.
 * If cacheDir is undefined, defaults to the NODE_MODULE_CACHE environment variable.
 * If NODE_MODULE_CACHE isn't set, default to `$TMPDIR/node_compile_cache`.
 * @param {string|undefined} cacheDir
 * @returns {string} The path to the resolved cache directory.
 */
module.enableCompileCache(cacheDir);

/**
 * @returns {string|undefined} The resolved cache directory, if on-disk compiled cache is configured.
 *   Otherwise return undefined.
 */
module.getCompileCacheDir();

process.getCompileCacheDir() would still allow end users to find and clean stale cache to release disk space. We could probably also add a file to the designated directory with a name that's easy to find (e.g. $CACHE_DIR/node_compile_cache_mark) to facilitate this too.

In most use cases, tooling and libraries should simply call module.enableCompileCache() without passing in an argument so that the cache is stored in tmpdir and can be shared with other dependencies by default, and end users can override the default cache directory location with NODE_COMPILE_CACHE. Some more advanced tooling/framework might want more advanced customizations and use their own cache directory, then they can specify it.

Some more powerful APIs are probably needed to allow advanced configuration of the cache storage, but at least the APIs mentioned above would address the use cases of existing v8-compile-cache users. For the more power API, it would be difficult to just think of one that works well without some collaboration with adopters, so ideas welcomed regarding how that should look like :)

joyeecheung commented 1 week ago

cc @merceyz @jakebailey @H4ad from https://github.com/nodejs/node/issues/47472

benjamingr commented 1 week ago

It's also common for library/framework authors to want to enable this in a more flexible manner.

Why? What sort of flexibility would the library/framework need that the environment variable doesn't provide?

jakebailey commented 1 week ago

This sounds great; supporting a default location in a reasonable location is super helpful.

Is node:module the right place for this? Or is node:v8 actually where it might be?

after this method is called.

All-in-all, I'm not sure how I feel about being unable to use this without using CJS or TLA; if an executable wants to enable caching of itself, it needs to have an extra entrypoint which only serves to enable the caching and then load the other code. Or, fork, which is slow. Not sure that one can do better, though. The call has to happen somewhere...

I guess this is exactly how v8-compile-cache works? (Not familar with its implementation but I guess it must have the same restriction...)

Why? What sort of flexibility would the library/framework need that the environment variable doesn't provide?

If you want to enable caching today, you have to set the environment variable. This means that applications which want to enable it for themselves have to fork a new process to enable it, defeating the speedup.

joyeecheung commented 1 week ago

Is node:module the right place for this? Or is node:v8 actually where it might be?

I used node:module off the top of my head but node:v8 sounds reasonable too. I am slightly leaning towards node:module because this only applies to the user modules loaded by the usual module loading process (so if the user compiles some module differently via vm APIs, this won't apply, at least not automatically).

joyeecheung commented 1 week ago

All-in-all, I'm not sure how I feel about being unable to use this without using CJS or TLA

You could also --import an ESM that does this call synchronously.

if an executable wants to enable caching of itself, it needs to have an extra entrypoint which only serves to enable the caching and then load the other code.

Yeah I think there is a general lack of way for libraries to "define something to be run before everything else, without the use of command line flags, or environment variables". It was also raised in the module loader hooks discussion (https://github.com/nodejs/node/issues/52219#issuecomment-2061740553). IMO we need to figure out a way to allow developers/users to specify code that needs to be preloaded for every/some process/worker. But some configuration needs to happen - perhaps some magic field in package.json is a good place for it to be done, but that would probably be a separate topic.

nodejs / node

APIs for libraries/frameworks/tools to control on-disk compilation cache (NODE_COMPILE_CACHE) #53639