Make npmInstall and npmSetup tasks cacheable

bercik commented 4 years ago

Currently npmInstall and npmSetup tasks are not cacheable by gradle build cache, thus when running CI pipelines on different machines those tasks needs to fully run every time which consumes a lot of time.

Please consider making those tasks cacheable. Instructions can be found here: https://docs.gradle.org/current/userguide/build_cache.html#enable_caching_of_non_cacheable_tasks

deepy commented 4 years ago

Long story short, those tasks don't cache very well, however, if you can get a local mirror of node and npm you should be able to get the same experience for nodeSetup/npmSetup.

And if you're using webpack or similar you can set up a task for that with inputs/outputs and cache it, which should help speed things up

bsautel commented 4 years ago

Let's have a look to these tasks (I add nodeSetup to the list):

nodeSetup download Node.js. It uses for that a Gradle repository that automatically caches downloaded artifacts (exactly the same way as it does for Maven artifacts). So it is downloaded only once, but when the task is not up-to-date, it unpacks the archive (which is not very different in terms of performance as using a cache).
npmSetup uses npmInstall to install a specific version of npm. See below.
npmInstall: as of version 5 (as far as I remember), it caches the downloaded packages to avoid downloading them each time they are installed. This task is also already cached, not by Gradle directly, but by npm.

All those tasks are not directly cacheable by Gradle, but they rely on tools that use cache.

I know this have some limitations. For instance, if the build runs inside a container whose file system is not persistent, the cache will not be reused for next builds by default. This will be also the case for Gradle, but this can be fixed by changing its cache configuration. It would be necessary to do the same thing for also for npm and it is much simpler if all the cache is managed by Gradle. Once the Gradle cache is configured, everything can be cached.

We can assume (it would be interesting to test it though), that npmInstall for instance would be even faster if it was fully cached by Gradle. Indeed, npm install does not only extract the package contents. It also executes some post install scripts (for instance a node-gyp compile). Caching the node_modules directory would enable us to skip this whole process. But I think it could cause multiple issues:

Do the post install scripts make change only to the node_modules directory? I don't really know whether there are some conventions regarding this behavior. If it's not true, this means that each npmInstall user would be in charge of configuring perfectly the task output to ensure the cached result is perfectly right.
As far as I know, the node_modules directory does not contain exactly the same thing in the different platforms. For instance, node-gyp packages contain some platform-specific binaries. We should probably add the platform as a cache key to ensure the node_modules directory will not be reused across different platforms.
There are probably other things I didn't think about.

I think we should make some experiments to ensure that make them cacheable really makes sense and does not create some unwanted side effects.

deepy commented 4 years ago

nodeSetup and npmSetup is going to be interesting to cache as they both access the same directories and create symbolic links (which are going to be replaced with the file they point to if we use the Gradle cache).

Luckily even if you run on an entire clean machine nodeSetup and npmSetup aren't that heavy, especially if you got distBaseUrl and .npmrc pointing to local mirrors or ones close to you.

bercik commented 4 years ago

Thanks for replying.

You are right, nodeSetup and npmSetup don't seem to heavy, but the npmInstall does. We did the following in our project:

We've added custom npmCi task that caches node_modules folder:

task npmCi(type: NpmTask, dependsOn: [npmSetup]) {
if (System.env['CI'] != null && System.env['CI'] == "TRUE") {
    args = ['ci']
} else {
    args = ['install']
}

inputs.file(file("${project.projectDir}/frontend/package.json")).withPathSensitivity(PathSensitivity.RELATIVE)
inputs.file(file("${project.projectDir}/frontend/package-lock.json")).withPathSensitivity(PathSensitivity.RELATIVE)

outputs.dir "${project.projectDir}/frontend/node_modules"
outputs.cacheIf { true }
}

In .gitlab-ci we are running the task as follows: ./gradlew npmCi --build-cache
We are caching .gradle folder in gitlab-ci:
```
cache:
paths:
  - .gradle
```

What do you think of that solution? It seems to be working, but I'm a little afraid about the possible problems you've mentioned above.

nukesz commented 4 years ago

Hi,

I've come up with the following code and it works for me:

npmInstall {
    inputs.files(project.files('package.json'))
    outputs.dir("${project.buildDir}/node_modules/")
}

task buildUI(type: NpmTask, dependsOn: npmInstall) {
    inputs.files(project.files('src'))
    outputs.dir("${project.buildDir}/dist/")
    args = ['run', 'build']
}

build.dependsOn buildUI

I'm running the gradle build and it's running npm install only when the package.json is changed and it runs npm build when the src folder is changed.

bsautel commented 4 years ago

@nukesz normally you don't need to configure the npmInstall as you did. The default configuration already contains those inputs and outputs.

I also would advice you to add the node_modules directory as an input of your buildUi task since a change in your project library will probably change the build output. I would also add the package.json file since you rely on its scripts when running run build. You can also use the NpxTask to avoid needing to declare the build script in the package.json file.

@nukesz, you are not talking about the same thing as @bercik. This issue is about caching, not just task execution avoidance. Task execution avoidance is the fact of skipping the task execution if the inputs did not change since last build and the outputs are still present and did not change. Caching is one step further. As Gradle knows which are the input and outputs of the task, it can save the output in a cache (which can be shared between multiple computers) and reuse the same output stored in the cache instead of running the task when the inputs are identical.

bsautel commented 4 years ago

@bercik is your experiment successful? Does it work as expected? Do you use only local cache or do you share the cache between multiple hosts?

remmeier commented 4 years ago

I would recommend merging npmInstall and build tasks. For example by using in package.json:

"build": "npm ci && ng build

This gives a single task in Gradle. So if the result (the compiled webapp) is already available, the build server can jump over setting up the node_modules directory. Which can save a lot of time.

And npm install should not be used, as it is not reproducible. Rather use npm ci.

deepy commented 4 years ago

@remmeier if the compiled web app is already available then both npmInstall (which can be configured to use npm ci) and your build task would be skipped as your build task is already up-to-date.

Though in your case there's a caveat with npm ci, which is that it'll delete your node_modules before running.

remmeier commented 4 years ago

does npm ci not do up-to-date checking? Actually in our project we make use of yarn, but apply the same techniques. This works well. Was hoping for NPM to get to the same level in that regard. With our single task yarn setup we manage to save around 30-60 second with every build per frontend project (which adds up, in particular for multi-project builds).

(there is a second new npm-based project on my side, i probably have some more insights soon in that regard)

bsautel commented 4 years ago

I did some experiments in another issue to know which one of Gradle up-to-date checking or npm install up-to-date checking and it appeared that Gradle was much faster than npm install (see here for Linux and here for Windows). Of course, it probably also depends on the project size (mainly the number of dependencies). But this was not really the result we were expecting.

I did not do this experiment with npm ci nor with yarn, that would be interesting to do it.

bsautel commented 4 years ago

And it sounds like npm ci always starts by removing the node_modules directory. So it does not seem to do any up-to-date checking attempt.

Here is the first line that is printed by npm ci

npm WARN prepare removing existing node_modules/ before installation

remmeier commented 4 years ago

I will have a look later and file a bug report if it really does not to up-to-date checking. I really like yarn & Gradle...

npmInstall and gradle cache restore will heavily depend on where the data is coming from. yarn is able to cache it locally, so it can be fast. The same for gradle. A close-by Gradle remote cache is also not so bad. But the public NPM registry should be avoided at all cost I think for anybody in need of r robust/fast builds. Your Gradle cache results look good for sure.

remmeier commented 4 years ago

the one things really good at yarn is that yarn install is super fast if all is up-to-date (around a second).

deepy commented 4 years ago

the one things really good at yarn is that yarn install is super fast if all is up-to-date (around a second).

I don't think realistically it can ever beat gradle doing up-to-date checks on your compile webapp task unless you either have a massive app or no dependencies.

bsautel commented 4 years ago

I did this experiment again using an Angular project with a multiple dependencies (902 packages at node_modules root) on an Ubuntu 20.04 laptop.

Note that I used the version 3 of the plugin (which will be released soon) because I discovered an issue in the version 2.2.3 regarding the yarn task outputs declaration (the task does not run when I remove the node_modules directory, whereas it should).

Command	Command execution time	Gradle up-to-date check
`yarn`	1 second	1 second
`npm install`	8 seconds	1 second
`npm ci`	40 seconds	1 second

We can see that yarn's up-to-date checking is fast, but not faster that Gradle. For npm install and npm ci, Gradle is clearly faster.

bsautel commented 4 years ago

I could investigate the issue I encountered, and it appears that the up-to-date checking of the yarn task works as expected, I probably did an error in my experiment.

But I discovered that the node_modules directory is declared twice as output, once as a directory and once as a file tree (unless you use the node_modules output filter). It is useless and slows down the Gradle's up-to-date checking. It takes 3 seconds instead of 1 (with the version 3 of the plugin).

I created the issue #96 and fixed it. It now also takes 1 second with the version 2.2.3 and the fix.

sergeykad commented 3 years ago

What about Yarn tasks ( YarnSetupTask, YarnInstallTask, YarnTask)? Can build cache be enabled for them?

deepy commented 3 years ago

Your own ad-hoc YarnTasks can absolutely be cached, where I work all our npm run dist takes nearly a minute each (and we got 4 in one project) They cache extremely well, especially when paired with a remote build-cache

doctorpangloss commented 2 years ago

The problem is that yarnSetup, npmSetup and nodeSetup reference absolute paths, busting the cache. Just don't do that. gradle shouldn't be installing those things anyway.

To fix caching, add this to your root build.gradle:

plugins {
  id "com.github.node-gradle.node" // specify version!
}

import com.github.gradle.node.task.NodeSetupTask
import com.github.gradle.node.yarn.task.YarnSetupTask
import com.github.gradle.node.npm.task.NpmSetupTask

afterEvaluate {
  // disable all setup tasks because they bust the cache
  allprojects {
    [YarnSetupTask.class, NpmSetupTask.class, NodeSetupTask.class].forEach({ setupTask ->
      tasks.withType(setupTask).configureEach({ task ->
        task.configure {
          onlyIf { false }
        }
      })
    })
  }
}

deepy commented 2 years ago

There's some other workarounds in https://github.com/gradle/gradle/issues/3525 as well, but with symbolic links being resolved to their file instead of being stored I'm not sure we ever could support the build cache in those tasks.

So if you're using download = false you'll get the same effect as the snippet above (but lose out on perfectly reproducible builds)

Cerber-Ursi commented 11 months ago

Bumping the thread, since the question I've run into seems to be related.

On Gradle 7.4, npmSetup task shows the following deprecation warning:

Cannot access output property 'npmDir' of task ':npmSetup' (see --info log for details). Accessing unreadable inputs or outputs has been deprecated. This will fail with an error in Gradle 8.0. Declare the task as untracked 
by using Task.doNotTrackState(). Consult the upgrading guide for further information: https://docs.gradle.org/7.4/userguide/upgrading_version_7.html#declare_unreadable_input_output

This is shown only on Windows - on Linux, even in WSL, there's no problem. I guess the source of the error is the symbolic links in bin, since they can't be opened by the ordinary Windows programs too. Does this mean that npmSetup on Windows will always be effectively uncacheable?

deepy commented 9 months ago

@Cerber-Ursi I'm unable to reproduce that issue using the example project, with or without download enabled, with or without a npmVersion set https://github.com/node-gradle/gradle-node-plugin/blob/main/examples/simple-node/npm/build.gradle

Can you open a separate issue and provide a standalone project that reproduces the issue?

doctorpangloss commented 9 months ago

in my experience since my last update, it's better to never cache npmInstall and npmSetup. they are too large for gradle to cache efficiently anyway. however, the yarn-state or yarn-integrity files or similar should be specified as outputs, and if you want to be extra disciplined, node_modules must be set to read only. with yarn, use yarn 2+ and plug'n'play if you want caching.

Cerber-Ursi commented 9 months ago

Interesting. I've tried to do this again and wasn't able to reproduce the original problem either, including in the real project where it was initially noticed. Will try again later, not sure what exactly has changed in between.

adam-enko commented 8 months ago

EDIT: I made a mistake, the information below is incorrect.

~I tried to enable caching by specifying outputs.cacheIf { true }~

tasks.npmInstall {
    outputs.cacheIf { true }
}

~I think it's relevant that I also have a custom NpmTask~

val npmRunBuild by tasks.registering(NpmTask::class) {
  npmCommand.set(listOf("run", "build"))
  inputs.files("package.json", "webpack.config.js")
    .withPathSensitivity(RELATIVE)
  inputs.dir("src/main")
    .withPathSensitivity(RELATIVE)
  inputs.dir(layout.projectDirectory.dir("node_modules"))
    .withPathSensitivity(RELATIVE)
    outputs.dir("dist")
}

~If I enable additional build cache logging (add org.gradle.caching.debug=true to gradle.properties) and run ./gradlew clean && ./gradlew assemble I can see this warning on the npmInstall task~

> Task :npmInstall
[...]
Non-cacheable because Gradle does not know how file 'node_modules/.package-lock.json' was created (output property 'nodeModulesPackageLock'). Task output caching requires exclusive access to output paths to guarantee correctness (i.e. multiple tasks are not allowed to produce output in the same location). [OVERLAPPING_OUTPUTS]

~Specifying the package-lock file as an output seems to help:~

tasks.npmInstall {
    outputs.file(layout.projectDirectory.file("node_modules/.package-lock.json"))
        .withPropertyName("nodeModulesPackageLock")
    outputs.cacheIf { true }
}

~I can then run~

~1. ./gradlew assemble > assemble1.log~ ~2. ./gradlew clean -q~ ~3. ./gradlew assemble > assemble2.log~

~In assemble2.log I can see that npmInstall is cached.~

> Task :npmInstall FROM-CACHE

doctorpangloss commented 8 months ago

tasks.npmInstall {
    outputs.file(layout.projectDirectory.file("node_modules/.package-lock.json"))
        .withPropertyName("nodeModulesPackageLock")
    outputs.cacheIf { true }
}

I don't think this does what you think it does. Unless npmInstall contains node_modules/ as an output directory...

delete node_modules/
run npmInstall
observe gradle will restore just node_modules/.package-lock.json from the cache

adam-enko commented 8 months ago

I don't think this does what you think it does.

You're right! It wasn't caching as I expected, I misunderstood what was going on.

adam-enko commented 8 months ago

Hi 👋

I've add an example test of the behaviour I'd like to see in https://github.com/node-gradle/gradle-node-plugin/pull/301

Currently I see warning messages from Gradle that explain why npmInstall can't be cached:

Non-cacheable because Gradle does not know how file 'package-lock.json' was created (output property 'packageLockFileAsOutput'). Task output caching requires exclusive access to output paths to guarantee correctness (i.e. multiple tasks are not allowed to produce output in the same location). [OVERLAPPING_OUTPUTS]

Non-cacheable because Gradle does not know how file 'node_modules/.package-lock.json' was created (output property 'nodeModulesPackageLock'). Task output caching requires exclusive access to output paths to guarantee correctness (i.e. multiple tasks are not allowed to produce output in the same location). [OVERLAPPING_OUTPUTS]

From a quick look removing these properties as outputs should work

packageLockFileAsOutput conflicts with the input property, packageLockFileAsInput
nodeModulesPackageLock should be tracked by the getNodeModulesDirectory output

Do these changes seem reasonable? I'm not familiar with NPM development so I would appreciate any insights.

deepy commented 8 months ago

Conceptually you've got the right idea @adam-enko, but there's a few things that makes npm install cache extremely badly or outright broken

Let's start with the easy one, Gradle currently breaks symlink and this makes node_modules unhappy, this is currently being worked on and might get resolved by the linked PR in https://github.com/gradle/gradle/issues/3525

Next one is harder because I don't have the data any longer and never thought to create benchmarks, but when your node_modules becomes "moderately large" you may find that Gradle spends minutes doing the up-to-date check This is especially funny if it turns out to be out-of-date and you then have to wait for npm install to do the same up-to-date check

These might be fine locally, but if you have a remote build cache then publishing a 1-2GB node_modules is likely to cause issues And even at 100-200MB, the cost of pulling it from the cache will likely be bad, especially if you have a npm mirror close to you

And with a remote build cache, what happens if you pull a node_modules using native components built for the wrong platform? Adding the platform to the cache key by default would help this, but unless your node_modules has no native components then I suspect caching it won't make much sense for the above reasons

So to list the problems I remember:

npm install already does this check and is pretty fast at this point
Gradle's up-to-date checking can get really slow with large outputs and node_modules can easily become large
Some tools in the JS ecosystem happily read and write to a folder of their own in node_modules, this used to upset me but I can't find any modern tools doing this so this might not be relevant any longer
Every project I've personally come in contact with has included native components, making node_modules platform dependent, this used to upset me because a lot of these didn't build on FreeBSD
Caching the build of your app tends to make more sense as it's generally slow and npm install on a moderately up-to-date node_modules can be very quick

With these in mind, most people who provided feedback (and most people polled) preferred the solution where Gradle mostly leaves everything to npm and just keeps track of whether it has run successfully with the current package.json (and lockfiles) and only expecting node_modules/.package-lock.json as output See the fastNpmInstall for more information https://github.com/node-gradle/gradle-node-plugin/blob/16ecc0fa492b65dfa46a559f432ade45ce500b8e/src/main/kotlin/com/github/gradle/node/NodeExtension.kt#L127-L134

So to wrap this up, do you really want to cache node_modules or are you ok with recreating it initially and having it be reported up-to-date like other tasks? i.e. do you want it cached or are you interested in fast builds that work well with the remote cache but with the downside that your CI needs to do an npm install?

deepy commented 8 months ago

This doesn't mean there's no use for this, it's just that all evidence I know of suggests that people think they want this but they really might want something else

There might be a legitimate use-case for this, or you might want to look into your build and consider the setup If you're an open-source project under a permissive license (like the MIT, BSD, Apache) I'll happily take a look at setup, if you're GPL I'll grudgingly take a look at the setup, if you're AGPL or commercial feel free to send me an email and we can discuss options

adam-enko commented 8 months ago

Thanks for the detailed response @deepy! It's a lot more clear to me now.

I'm working on Dokka - which uses node-gradle in one subproject https://github.com/Kotlin/dokka/blob/e502b2c1b1aa168bb3cd10d4d5f3ab883daffabd/dokka-subprojects/plugin-base-frontend/build.gradle.kts

The main issue is that npmInstall will run even though there haven't been any changes, which can take 1+ minute on CI even if there are no changes https://ge.jetbrains.com/s/7z6wysb7gfgf6/timeline#xhxfjcjtknpfe

I have made some changes in https://github.com/Kotlin/dokka/pull/3479 that I expect will help.

Regarding the caching of NpmInstallTask, I can think of two options:

NpmInstallTask should be annotated with @DisableCachingByDefault, and update the KDoc to add the reasons you explained so clearly.

Perhaps go even further: the risk of having platform-dependent files in node_modules is too likely, and the resulting issues too subtle, and has a high risk of poor performance, and NpmInstallTask should be marked with doNotTrackState()?
Take my proposal PR https://github.com/node-gradle/gradle-node-plugin/pull/301 further, so that if someone wants to make NpmInstallTask cacheable they can.

There's a risk that someone, like myself(!), might enable caching without fully understanding the risks and costs.

adam-enko commented 8 months ago

Has @LocalState been considered for node_modules/? It looks like a good fit.

https://docs.gradle.org/8.5/userguide/custom_tasks.html#sec:storing_incremental_task_state

However, if the state files are non-relocatable, then they can’t be shared via the build cache. Indeed, when the task is loaded from cache, any such state files must be cleaned up to prevent stale state from confusing the tool during the next execution. Gradle can ensure such stale files are removed if they are declared via task.localState.register() or if a property is marked with the @LocalState annotation.

sergeykad commented 8 months ago

@adam-enko That's an interesting proposal, but is not the directory too big for that? IIRC the default artifact size limit for caching is 100MB.

deepy commented 8 months ago

Haven't had the time to look into this properly yet, but on Windows I get a 218MB node_modules from the plugin-base-frontend which I guess isn't too bad But running npm install locally with no caches whatsoever takes 15 seconds on my machine, which is desktop equipped with an ok Ryzen CPU, a NVMe SSD, a gigabit connection, and way too much memory so not exactly a very useful comparison Though I'm curious on where the bottleneck is, because you might see even worse performance with the remote cache

Good news is we can easily test that by creating a snapshot and comparing the builds Do you have a maven repository you could use for the snapshot? If not I can probably create a public one on my private nexus instance

deepy commented 8 months ago

But before that, can you add the --production flag to npmInstall or set the environment variable NODE_ENV to production and do a comparison?

deepy commented 8 months ago

That's apparently --omit=dev now

node-gradle / gradle-node-plugin

Make npmInstall and npmSetup tasks cacheable #81