gradle / gradle

Adaptable, fast automation for all
https://gradle.org
Apache License 2.0
16.71k stars 4.68k forks source link

Let multiple containers share downloaded dependencies #851

Closed bsideup closed 4 years ago

bsideup commented 7 years ago

Hi!

Looks like Gradle is locking the global cache when running the tests. We run Gradle in Docker containers, and from what I saw in the logs, it fails to acquire the lock with:

14:09:05.981 [DEBUG] [org.gradle.cache.internal.DefaultFileLockManager] The file lock is held by a different Gradle process (pid: 1, operation: ). Will attempt to ping owner at port 39422

Expected Behavior

Gradle should release the lock when it executes the tests. Other Gradle instances are failing with:

Timeout waiting to lock Plugin Resolution Cache (/root/.gradle/caches/3.2/plugin-resolution). It is currently in use by another Gradle instance.

Current Behavior

Gradle holds the lock

Context

Our CI servers are affected. Parallel builds are impossible.

Steps to Reproduce

I managed to reproduce it with a simple Docker-based environment: https://github.com/bsideup/gradle-lock-bug

Docker and Docker Compose should be installed.

$ git clone git@github.com:bsideup/gradle-lock-bug.git
$ COMPOSE_HTTP_TIMEOUT=7200 docker-compose up --no-recreate

Your Environment

Gradle 3.2 (tried with 3.1, 3.0 and 2.12 as well) Docker

oehme commented 7 years ago

I don't quite understand the use case yet. Are you running several builds at the same time on the same working directory? That will give you many other odd problems besides just the .gradle directory being locked. Just think about what happens when one of those builds runs a clean while another is trying to compile.

If you want to do different builds on the same project at the same time, I'd recommend using separate checkouts for that.

Just a minor piece of terminology: The .gradle directory inside your project is the local cache. The global caches are in the user home by default.

bsideup commented 7 years ago

@oehme We don't, I just reused the same project source to demonstrate an issue. We run different projects inside the containers at the same time with .gradle folder shared across them (think CI environment)

oehme commented 7 years ago

Got it, thanks. The Gradle user home cannot be shared between different machines. Why do you want to share it? It'll just create contention between your builds, even if this specific issue was solved.

bsideup commented 7 years ago

@oehme well, it's a bit hard to define "machine" here. We use Docker containers, on the same host.

Having to create a separate .gradle for each project sounds a bit expensive and breaks the concept of the shared global cache.

oehme commented 7 years ago

A docker container is a machine for that matter. It's processes are isolated from the host system.

Having to create a separate .gradle for each project sounds a bit expensive and breaks the concept of the shared global cache.

I don't really understand the use case I guess. What is the reason to run the builds in docker containers, but share the user home? If you don't trust the code, then it absolutely should not have write access to the host's user home. If you trust it, then what do the docker containers buy you?

bsideup commented 7 years ago

@oehme it's not about the security.

We use Docker containers as a unified way to run different kinds of builds in our CI process, different projects might want to use different Java versions, for instance

This is pretty common things nowadays I should say. Jenkins is promoting Docker builds a lot, others are integrating Docker containers as well.

I understand the problem with "the multiple machines issue". However, this issue is more about "Why test executor takes a lock for a long time?", because AFAIK locks in Gradle are short living things.

oehme commented 7 years ago

Gradle processes will hold locks if they are uncontended (to gain performance). Contention is announced through inter-process communication, which does not work when the processes are isolated in Docker containers.

bsideup commented 7 years ago

Hm, back in the days it was different - Gradle was trying to release the lock as soon as possible, and I really liked that strategy. What happened? :)

oehme commented 7 years ago

The cross-process caches use file-based locking, so every lock/unlock operation is an I/O operation. Since these are expensive, we try to avoid them as much as possible.

bsideup commented 7 years ago

Any chance to get them configurable? I would really like to disable this optimization on our CI environments. Otherwise, we just delete lock file manually to workaround the issue when there are some long-running tests are being executed :D

oehme commented 7 years ago

We could potentially add a system property that tells Gradle to "assume contention". There might be other issues that we haven't yet discovered though, since sharing a user home between machines is not a use case we have designed for.

I'd like to assess the alternatives first: What would be the drawback if you don't share the user home?

bsideup commented 7 years ago

A huge disk space and network usage. We will have to download the same dependencies for every Gradle job type. Right now Gradle cache takes a few GBs, but if we don't share, we will have to multiply it by the number of Gradle-based jobs we have, so the result will be tens, or maybe even hundreds of GBs, which is not really acceptable for us

oehme commented 7 years ago

I think the best next step would be for you to implement a fix for that specific problem and try it out in your environment.

My gut feeling is that there may be other issues waiting when you try to reuse the user home. If there aren't, then we could discuss introducing a flag into Gradle to opt-in to a "docker mode" :)

bsideup commented 7 years ago

@oehme ok, thanks for the link! I'll try to play around with it and will report back.

Also, there is one more option - on *nix-based systems, Gradle can use sockets to communicate. That way it should work, and Docker will allow us to mount the socket inside a container.

WDYT?

oehme commented 7 years ago

That could work ad well. Let's first make sure though that the locking problem is in fact the only problem here.

saiimons commented 7 years ago

@bsideup Did you fix this ? I am currently facing this issue with the same kind of setup as yours... At least, it would be nice to have an option to set the timeout.

martinda commented 7 years ago

Another use case is when a user runs multiple different builds of different projects on multiple different hosts, all using his/her account. This is typical of environments with network mounted home directories.

Gradle has to pro-actively release the lock as soon as it is done with the cache. I am willing to pay the price of an IO operation to save the build from a timeout. Please see the excellent explanation in this GRADLE-3106 comment.

martinda commented 7 years ago

Just want to explain how to reproduce this problem by posting a simple build.gradle file:

task sleep() {
    doLast {
        Thread.sleep(100000)
    }
}

Get two terminals on different hosts that mount the same home directory with the same ~/.gradle in it, then type gradle sleep --debug --stacktrace in both terminals. One of them will fail to acquire the lock and die waiting. The failing one will show:

The file lock is held by a different Gradle process (pid: 64549, operation: ). Will attempt to ping owner at port 40291

Of course the other process cannot be notified, it is on another host, resulting in:

Caused by: org.gradle.cache.internal.LockTimeoutException: Timeout waiting to lock file hash cache (/home/martinda/.gradle/caches/3.5/fileHashes). It is currently in use by another Gradle instance.
    Owner PID: 64549
    Our PID: 25504
    Owner Operation: 
    Our operation: 
    Lock file: /home/martinda/.gradle/caches/3.5/fileHashes/fileHashes.lock

Could it be as simple as adding the IP address of the process holding the lock to the lock file and add it to the pingOwner method?

AdrianAbraham commented 7 years ago

My team is also encountering this issue when dealing with containerized CI builds, forcing us to keep many copies of the Gradle cache. We'd love to see an option to aggressively release the lock.

bsideup commented 7 years ago

FYI: For us, the workaround was to run the container where Gradle is with --net=host, this way Gradle will be able to communicate with other instances.

saiimons commented 7 years ago

My workaround is setting up a maven repository acting as a proxy (and cache) and loading some init.gradle in the build container. This allows us to keep one cache instead of multiple ones, and there is no conflict.

AdrianAbraham commented 7 years ago

@bsideup That sounds good, though I don't know if I'll be able to convince my CI runner to do that ... I'll try it out.

@saiimons Could you go into more detail about your workaround?

saiimons commented 7 years ago

@AdrianAbraham I run a Nexus repository with a proxy configuration for the major maven repositories (jcenter, maven central, etc. check your log for the URLs).

image

All these guys go behind a group, in order to use a single URL:

image

Then my build container will load a init.gradle file (in /opt/gradle/init.d/ as I am using this image for CI) :

allprojects {
  buildscript {
    repositories {
      mavenLocal()
      maven {
        url "http://nexus:8081/repository/global_proxy/"
      }
    }
  }
  repositories {
    mavenLocal()
    maven {
      url "http://nexus:8081/repository/global_proxy/"
    }
  }
}
AdrianAbraham commented 7 years ago

@saiimons We're running a local Nexus, and our configuration is similar (no mavenCentral(), though); but Gradle still has to download the packages from Nexus into its own cache to run a build. Does your setup just avoid Internet downloads? Or does it avoid the Gradle cache itself?

saiimons commented 7 years ago

Yes, gradle downloads the packages, but as the storage is local, the latency is small and there is no bandwidth consumed. My build sped up and the need for external resources was reduced (we were able to build when DNS servers were DDoSed in October and most of the repositories were unreachable).

hexsel commented 7 years ago

We're doing something similar, but using automatically-generated docker containers for each branch (this is handled by Jenkins 2.0 pipelines and it's somewhat flexible). We still have to download 2GB or so worth of jars for every build of every branch, and keep storage for all of these until Jenkins auto-cleans them a few days later.

It's not a deal breaker, but it's a serious inconvenience. I could live with an occasional contention on the folders if it made the downloads unnecessary.

Is there any way to configure the timeout for the cache?

andrask commented 7 years ago

We used to solve this issue by generating a base image that cached most of the dependencies. This base image was rebuilt every day, thus, the actual difference between the current and cached dependencies was kept small.

In our new project, however, we are facing the same issue as @bsideup as we want to share the cache between containers (the above scheme works for this). At the same time we are limited by the 'concurrent builds cannot use the same cache' aspect. Actually, the real issue is not even the cache size or the network usage, but the serialized download of artifacts. We have 10G connection between the servers, (I even put the repos on RAM disk,) still the network utilization is low as the dependencies are downloaded sequentially. Probably even over different TCP connections, resulting in lots of time wasted for TCP ramp up? This may be a result of using Maven repos. The real solution would probably be a server side resolver framework that could return all the artifact URLs at once.

oehme commented 7 years ago

Actually, the real issue is not even the cache size or the network usage, but the serialized download of artifacts.

Gradle 4.0-milestone-2 downloads metadata and artifacts in parallel, you might wanna give that a try.

andrask commented 7 years ago

Gradle 4.0-milestone-2 downloads metadata and artifacts in parallel, you might wanna give that a try.

Definitely very promising. Once 4.0 comes out, I'll push for the upgrade.

zageyiff commented 7 years ago

Hello, We have similar setup, docker containers as jenkins slaves. We're trying to have the gradle user home as a mounted volume on the docker container so all containers reuse the gradle cache and avoids downloading the common 3pp dependencies.

Using Gradle 4 RC 3, still does not fix the issue FAILURE: Build failed with an exception.

As of now, we need to set gradle user home unique per container so each build is downloading all dependencies. We're not able to set --net-host, as we have some services in the container that are used as part of integration test (postgres) and we would need to have different port per container

mkobit commented 7 years ago

We see this happen periodically in our Jenkins build fleet as well.

The general use pattern is:

We then sometimes see the same error mentioned by @zageyiff for users' builds that execute on that node.

Could not create service of type FileHasher using GradleUserHomeScopeServices.createCachingFileHasher()

Some builds execute successfully on a node while others may not. They may fail or be manually killed. Then, a Gradle execution will finish with the error above.

I don't have an easy way to reproduce it yet, but if there is anything useful I can provide here let me know.

gayakwad commented 7 years ago

Is it possible to point gradle to an additional read-only local cache ? If this can be achieved a read only volume can be mounted on a container, which can avoid multiple user and permission related problem as well.

RMsiemens commented 7 years ago

We observe this on our docker based Gitlab runners as well, as we also have one shared cache for all runners. We share the cache by mounting the cache folder into every docker container.

We have now a workaround in place. We do not any longer share the cache but have individual caches in each docker container. This results in slower build times and uses a lot of resources (Storage, bandwith, ...).

Would be nice if the proposed new gradle system property "assume contention" could be implemented.

dforrest88 commented 7 years ago

This issue is causing a lot of pain for me as well. We're trying to not rely on build node dependencies and thus running all of our gradle tasks inside of docker containers. Given that we also have several gradle utilities that we use to execute fine-grained build and environment tasks with a high degree of parallelism, each task takes significantly longer than it should because it has to fetch all of the dependencies. It would be nice to be able to mount gradle home on the build node each time leverage the shared cache similarly to how it can be done in maven.

Does anyone have an example for how they implemented pre-cached containers? Do you link a data container on gradle home that is pre-populated with the expected cache?

aaroncline commented 6 years ago

Similar to @RMsiemens, I'm looking at using a multi-stage Docker build on Gitlab Runners to build Java apps and a subsequent application container. I also plan to use Gitlab Runners to run CI tests on commits and merged in Docker containers. I want to do this to more easily offer multiple Java versions to our dev teams.

I would appreciate a gradle cache sharing solution allowing me to have a common cache on the Gitlab Runners that I can share to multiple Docker containers running builds or tests that may be happening at the same time. We use a remote artifact store and will have to pay for massive amounts of bandwidth to perform the download of dependencies each time a new build or merge happens and all of the dependencies are downloaded.

mdekstrand commented 6 years ago

Another use case for this is using Gradle for automation in an HPC environment, particularly with different tasks that the user may want to run in parallel in a computing cluster.

These clusters typically have high-performance networked file systems.

Right now, wrangling Gradle to work on HPC systems is not easy.

costincaraivan commented 6 years ago

I hope this can be reprioritized. Building in Docker containers is becoming more and more common, especially for CI, and since containers usually start from a "clean" state, this means that each and every build triggers an artifact download. The alternative is to bake the Gradle cache in the container, which is really hacky and also makes the containers really big.

The cleanest solution, by far, is Gradle being able to share the same cache without any conflicts between multiple instances. I appreciate Gradle, but c'mon folks, (conceptually) it shouldn't be that hard, Maven 2 was doing it without breaking a sweat back in 2008 😢

Rushit commented 6 years ago

We are facing exact same problem. Please let us know when should we expect fix for this.

martino-letgo commented 6 years ago

Hello guys, I am facing the same issues while trying to run multiple Dockers that share the same cache. Any workaround?

sschuberth commented 6 years ago

A possible work-around that was communicated to us by Gradle support is:

[...] by running a remote cache container per docker host and arranging for the containers doing builds with the build tool to use that endpoint for remote caching. Gradle Enterprise also has a high performance shared remote cache backend that you could try out for free.

And:

We make a docker container available of a remote cache: https://hub.docker.com/r/gradle/build-cache-node/

Tapchicoma commented 6 years ago

But it sounds more like caching the tasks output and not for caching the dependencies.

Also possible workaround would be spinning up proxy artifcatory image in same as CI network.

anderslauri commented 6 years ago

To workaround this issue we let our containers (jenkins nodes) rsync the entire gradle cache to a common persistent volume when they are destroyed after each unique build process (rsync is very efficent in not copying existing files etc). During startup of the containers the cache is reloaded from the persistent volume using rsync as well to in memory on the jenkins node. It is a simple solution using merely shell scripting but it is effective for us and rsync deals with most complexit, in combination we do use the gradle build cache and nexus as well locally within our CI/CD-cluster but those changes only made minor performance improvements - compared to using a local in memory gradle cache.

However it would be nice not to be locking files when reading from the cache to share common persistent volume accross all our container for reading only.

FrailWords commented 6 years ago

Another workaround and not a good one is to set different local cache folders for each Gradle build - http://mrhaki.blogspot.in/2017/04/gradle-goodness-change-local-build.html. This is not optimal but is simple.

dzwicker commented 6 years ago

Daniel Zwicker Schulstraße 45b 64342 Seeheim-Jugenheim

On 4. Dec 2017, at 15:17, Sriram Viswanathan notifications@github.com wrote:

Another workaround and not a good one is to set different local cache folders for each Gradle build - http://mrhaki.blogspot.in/2017/04/gradle-goodness-change-local-build.html http://mrhaki.blogspot.in/2017/04/gradle-goodness-change-local-build.html. This is not optimal but is simple.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gradle/gradle/issues/851#issuecomment-348974306, or mute the thread https://github.com/notifications/unsubscribe-auth/AAejAyWAaO7Q7n9mDTWX7ZWQ-ZkTiZOHks5s8_7pgaJpZM4KxW7_.

Moritz90 commented 6 years ago

FWIW, I'm getting errors related to the GRADLE_USER_HOME even though I'm using different GRADLE_USER_HOMEs for each job. Builds will sporadically fail with the message:

Could not create service of type FileHasher using GradleUserHomeScopeServices.createCachingFileHasher()

All GRADLE_USER_HOMEs are on NFS.

It seems like Gradle's caching issues are not limited to concurrent use of the same GRADLE_USER_HOME.

marmax commented 6 years ago

@anderslauri would you mind to share the scripts (jenkinsfile, bash, whatever), if possible?

anderslauri commented 6 years ago

@marmax

It's simple, stupid simple. Below is the snippet for our OpenShift DeploymentConfiguration for our Jenkins node (OpenShift pod) which executes the following script once it's get killed after each unique Jenkins build (the script has a timeout per default of 30 seconds - can be increased).

  - image: nexus.kafkaint.fhm.de/stargate/jenkins-node:latest
    lifecycle:
      preStop:
        exec:
          command: ["/home/jenkins/bin/stop"]

The script stop contains the following where the variable GRADLE_CACHE is the directory for the Gradle build cache in memory on the pod and the respective NFS variables are PVC mounted directories on the pod.

if [ -d "${GRADLE_CACHE}" -a -d "${GRADLE_CACHE_NFS}" ]; then
  rsync --whole-file --ignore-existing --recursive "${GRADLE_CACHE}/" "${GRADLE_CACHE_NFS}"
fi

if [ -d "${SONAR_CACHE}" -a -d "${SONAR_CACHE_NFS}" ]; then
  rsync --whole-file --ignore-existing --recursive "${SONAR_CACHE}/" "${SONAR_CACHE_NFS}"
fi

When a new pod is started in OpenShift the entrypoint script contains this snippet below. All of this works very fine for us, we use a cronjob in OpenShift to clean the cache on the NFS once a month to ensure a relevant cache also. There has not been any issues with performance that we have noticed, rsync is very efficent - our Gradle cache is around 3-4 gb.

  # Synchronize the Gradle cache to the container.
  if [ -d "${GRADLE_CACHE_NFS}" -a -d "${GRADLE_CACHE}" ]; then
    nohup rsync --whole-file --ignore-existing --recursive "${GRADLE_CACHE_NFS}/" "${GRADLE_CACHE}" > /dev/null 2>&1 &
  fi

  # Synchronize the SonarQube cache to the container.
  if [ -d "${SONAR_CACHE_NFS}" -a -d "${SONAR_CACHE}" ]; then
    nohup rsync --whole-file --ignore-existing --recursive "${SONAR_CACHE_NFS}/" "${SONAR_CACHE}" > /dev/null 2>&1 &
  fi
ntkoopman commented 6 years ago

Is there any stance from the Gradle team about this? Is this something that should be supported or are we supposed to go all in on the build cache and a local repository? All I see here are workarounds.

oehme commented 6 years ago

The answer depends on the problem you are trying to solve. Each of them might have different solutions.

  1. Dependencies being re-downloaded? Use Nexus/Artifactory close to your build agents (e.g. as a container on the same machine). We might also put downloaded dependencies into the Gradle build cache to serve a similar purpose.
  2. Too much disk space being used because each agent stores its own copy of the dependencies? If you use ephemeral agents that are cleaned up after each build this should not be an issue. If you use long-lived agents we could work against the cache growth by implementing something for #1085
  3. You want to share configuration like gradle.properties or init scripts? Only copy those into the agents. We might separate a configuration dir and a cache dir in the future.

Any other use cases I'm missing?

bsideup commented 6 years ago

@oehme yes. The reported one :D

  1. Concurrent access to ~/.gradle folder from different Gradle daemons / instances.

Currently, they use localhost to communicate with each other. If you change it to file socket, for instance, it shouldn't be a problem anymore.

Current workaround for Docker users: Run your containers with --net=host, so that Gradle instances will communicate with each other