gradle / actions

A collection of GitHub Actions to accelerate your Gradle Builds on GitHub
https://github.com/marketplace/actions/build-with-gradle
MIT License
147 stars 37 forks source link

setup-gradle: GUH cache key collision when used in a reusable workflow #316

Open cloudshiftchris opened 1 month ago

cloudshiftchris commented 1 month ago

Repro: https://github.com/cloudshiftinc/gradle-setup-action-cache-repro/actions/runs/10067337462

That repository consists of a small "Hello World" Gradle / Kotlin project, with two build files: 1) Build with Kotlin 1.8.22; 2) Build with Kotlin 1.9.25;

There's a GHA workflow that runs two steps, sequentially, using each of the above build files. Each step checks out the code, calls setup-gradle (v4 beta 1), and executes the respective Gradle build file (build.gradle.kts or build-second.gradle.kts).

The second build "cleans up" artifacts it didn't use from the first build:

Build cache (/home/runner/.gradle/caches/build-cache-1) removing files not accessed on or after Tue Jul 23 22:45:31 UTC 2024.
Build cache (/home/runner/.gradle/caches/build-cache-1) cleanup deleted 0 files/directories.
Build cache (/home/runner/.gradle/caches/build-cache-1) cleaned up in 0.001 secs.
groovy-dsl (/home/runner/.gradle/caches/8.9/groovy-dsl) removing files not accessed on or after Tue Jul 23 22:45:31 UTC 2024.
groovy-dsl (/home/runner/.gradle/caches/8.9/groovy-dsl) cleanup deleted 2 files/directories.
groovy-dsl (/home/runner/.gradle/caches/8.9/groovy-dsl) cleaned up in 0.01 secs.
Artifact transforms cache (/home/runner/.gradle/caches/8.9/transforms) removing files not accessed on or after Tue Jul 23 22:45:31 UTC 2024.
Artifact transforms cache (/home/runner/.gradle/caches/8.9/transforms) cleanup deleted 0 files/directories.
Artifact transforms cache (/home/runner/.gradle/caches/8.9/transforms) cleaned up in 0.027 secs.
jars (/home/runner/.gradle/caches/jars-9) cleanup deleted 0 files/directories.
jars (/home/runner/.gradle/caches/jars-9) removing files not accessed on or after Tue Jul 23 22:45:31 UTC 2024.
jars (/home/runner/.gradle/caches/jars-9) cleanup deleted 1 files/directories.
jars (/home/runner/.gradle/caches/jars-9) cleaned up in 0.003 secs.
artifact cache (/home/runner/.gradle/caches/modules-2) cleanup deleted 0 files/directories.
artifact cache (/home/runner/.gradle/caches/modules-2) [subdir: /home/runner/.gradle/caches/modules-2/resources-2.1] cleanup deleted 0 files/directories.
artifact cache (/home/runner/.gradle/caches/modules-2) [subdir: /home/runner/.gradle/caches/modules-2/resources-2.1] removing files not accessed on or after Tue Jul 23 22:45:31 UTC 2024.
artifact cache (/home/runner/.gradle/caches/modules-2) [subdir: /home/runner/.gradle/caches/modules-2/resources-2.1] cleanup deleted 0 files/directories.
artifact cache (/home/runner/.gradle/caches/modules-2) [subdir: /home/runner/.gradle/caches/modules-2/files-2.1] cleanup deleted 0 files/directories.
artifact cache (/home/runner/.gradle/caches/modules-2) [subdir: /home/runner/.gradle/caches/modules-2/files-2.1] removing files not accessed on or after Tue Jul 23 22:45:31 UTC 2024.
artifact cache (/home/runner/.gradle/caches/modules-2) [subdir: /home/runner/.gradle/caches/modules-2/files-2.1] cleanup deleted 42 files/directories.
artifact cache (/home/runner/.gradle/caches/modules-2) [subdir: /home/runner/.gradle/caches/modules-2/metadata-2.106] cleanup deleted 0 files/directories.
artifact cache (/home/runner/.gradle/caches/modules-2) cleanup deleted 0 files/directories.
artifact cache (/home/runner/.gradle/caches/modules-2) cleaned up in 0.041 secs.

This isn't desirable in a multi-step setup - we'd want to retain all the cached entries and only cleanup at the end.

For reasons I don't currently understand the cache isn't updated after the cleanup in the second build, even though the cleanup removed entries:

Entry: /home/runner/.gradle/caches/modules-*/files-*/*/*/*/*
    Requested Key : gradle-dependencies-v1-40c681dd67f1cd4e62478d3e197f1e6d
    Restored  Key : gradle-dependencies-v1-40c681dd67f1cd4e62478d3e197f1e6d
              Size: 88 MB (92509727 B)
              (Entry restored: exact match found)
    Saved     Key : 
              Size: 
              (Entry not saved: referencing 'Gradle User Home' cache entry not saved)

The issue starts here where setup-action stores the current timestamp, and then in the post-build cleanup removes everything older than that. This assumes a closed-loop where it's a self-contained build, which doesn't work for shared caches (dependencies, wrappers, etc) that other builds will use/update/cleanup.

Not sure what the best solution may be - perhaps loosen the cleanup windows to "older than a day"; not thrilled about making it configurable, as one wouldn't know what to reasonably put (is 10m too long? what if the execution time varies? and so on).

If there's a way to get the "start of the workflow" timestamp that may be more deterministic, though that still leaves holes where there are different workflows running that share those caches - one may wish to save the shared caches for longer periods (e.g. workflows that use those caches that run weekly, for example).

bigdaz commented 1 month ago

As far as I can see, this is working as expected.

Note that when you re-run a Job, it doesn't get a new git SHA and thus you don't get a new cache key for Gradle User Home. This is why you see the "Entry not saved: referencing 'Gradle User Home' cache entry not saved". To test this properly, you'll need to push a dummy commit for each run.

Please check out https://github.com/gradle/actions/blob/main/docs/setup-gradle.md#how-gradle-user-home-caching-works for more details on how this works.

** There are definitely ways this could be made more efficient, if we either stored one big entry with all of the dependencies for all of the jobs, or by extracting the common dependencies into one entry and only storing the unique dependencies per-job.

cloudshiftchris commented 1 month ago

Thanks for the clarification; have re-run tests (with dummy commits) - still seeing cleanups happening but presumably these relate to unused items within that build (e.g. removing the unused Groovy DSLs as its a Kotlin DSL script). Subsequent runs don't redownload artifacts, so good there.

Familiar with the documentation - what led me to believe that say dependencies were a shared cache across jobs:

1) The cache keys, as shown in the GHA cache summary, don't include job/step names, e.g. gradle-dependencies-v1-dadd467fc5b383227281c888d7a80dbc vs gradle-home-v1|Linux-X64|build-second[27df40b80d25f4e0c5be504ed712742e]-dd00fdf2b4588b9de17b10ade0eb6d7891e723b3, indicating (incorrectly) that these are separate from the job. (it isn't obvious that the opaque hash key represents the set of deps)

2) The documentation has a whole section on cache deduplication that lists certain caches as being "independent" which could be interpreted as "shared across jobs" (the bottom sentence there reinforces that, and indeed some caches are shared across jobs).

If these aren't indeed "shared" caches the documentation should be clearer on that - perhaps a table indicating what items are shared across jobs (where all other attributes match) etc.

bigdaz commented 1 month ago

The dependencies caches are shared between Jobs, if those Jobs have the exact same set of dependencies resolved.

I might want to clarify the docs a little:

If the setup-gradle action were smarter, it could use a "layered" approach where the set of dependencies that is commmon between Jobs would be extracted into a shared entry, and a second entry would have the dependencies that are unique to the Job.

bigdaz commented 1 month ago

I did try to have a separate cache entry for each dependency, for maximal cache storage efficiency. This approach quickly led to 429 errors in the cache: GHA cache doesn't like too many small entries.

cloudshiftchris commented 1 month ago

lol, yea, can see that being a challenge. Lots of complexities here. Thanks for continuing to evolve this.

aSemy commented 1 month ago

Thank you both for investigating and explaining.

I understand that the caches are saved 'per job', and I wonder how that works with re-usable workflows?

Kotest uses a re-useable workflow to call Gradle. The Gradle workflow is called twice in the same job: once to generate a JVM API dump (using BCV), and then again to run all Kotlin Multiplatform tests.

BCV only needs JVM dependencies, while the KMP tests require all KMP dependencies.

How would the dependencies cache for the job be populated? Would a single one be re-used for both invocations of the re-usable workflow, or a separate one each? Could it cause the dependency cache to be unnecessarily discarded?

cloudshiftchris commented 1 month ago

Here's a Kotest run that shows the problem.

Specifically, this workflow snippet:

jobs:

   validate-api:
      name: Validate API
      if: github.repository == 'kotest/kotest'
      uses: ./.github/workflows/run-gradle.yml
      secrets: inherit
      with:
         runs-on: ubuntu-latest
         ref: ${{ inputs.ref }}
         task: apiCheck

   validate-primary:
      name: Validate on primary runner
      if: github.repository == 'kotest/kotest'
      needs: validate-api
      uses: ./.github/workflows/run-gradle.yml
      secrets: inherit
      with:
         runs-on: ubuntu-latest
         ref: ${{ inputs.ref }}
         task: check

...doesn't store cache information on the second job, many entries such as:

Entry: /home/runner/.gradle/caches/modules-*/files-*/*/*/*/*
    Requested Key : gradle-dependencies-v1-a702ec8f84890c954ac77cc8b9b7e1be
    Restored  Key : gradle-dependencies-v1-a702ec8f84890c954ac77cc8b9b7e1be
              Size: 445 MB (466172742 B)
              (Entry restored: exact match found)
    Saved     Key : 
              Size: 
              (Entry not saved: referencing 'Gradle User Home' cache entry not saved)

...perhaps due to Gradle User Home cache not changing (same SHA, same OS, ...):

Entry: Gradle User Home
    Requested Key : gradle-home-v1|Linux|run-tests[2c202e428a822dfa2e29a4f7ab04d507]-0054608c69772bd253aec319bedc2231983e13f1
    Restored  Key : gradle-home-v1|Linux|run-tests[2c202e428a822dfa2e29a4f7ab04d507]-0054608c69772bd253aec319bedc2231983e13f1
              Size: 1 MB (791418 B)
              (Entry restored: exact match found)
    Saved     Key : 
              Size: 
              (Entry not saved: cache key not changed)
cloudshiftchris commented 1 month ago

Ahh....

This cache key for GUH: gradle-home-v1|Linux|run-tests[2c202e428a822dfa2e29a4f7ab04d507]-0054608c69772bd253aec319bedc2231983e13f1 contains run-tests as the job name - would expect it to change, to be validate-api or validate-primary.

run-tests is the job name in the reusable Github workflow that setups & executes Gradle via uses: ./.github/workflows/run-gradle.yml.

So it would appear that setup-gradle has the immediate job name, but not the overall context and we're colliding as a result.

cloudshiftchris commented 1 month ago

(this is the issue I thought was a race condition in the initial part of this ticket)

bigdaz commented 1 month ago

@cloudshiftchris Thanks for investigating further. This indeed looks like a bug, but I'm not sure how to fix it.

The full identifier of the "Job" is run-tests[2c202e428a822dfa2e29a4f7ab04d507]: the part in brackets should uniquely identify different jobs with the same name, generated here. It takes into account the following:

Unfortunately this doesn't differentiate between 2 Jobs with the same name in the same workflow, which is what you have.

Although validate-api ("Validate API") and validate-primary ("Validate on primary runner") are declared in the workflow file and are visible in the UI, there's no obvious way to access these values during workflow execution. If you can find a way, then it should be possible to include these in the cache key.

cloudshiftchris commented 1 month ago

Oh wow, this (getting the full job name) is way more complicated than it should be.

There are a couple of open issues on this:

Setup repo to reproduce this (one workflow, two jobs that call a reusable workflow) with lots of debugging - the only place the full job name is evident is in an early job entry: Complete job name: build-first / run-tests - none of the contexts or env vars have the build-first portion, only the current job name of run-tests.

This snippet returns something that may be useful (it has build-first / run-tests) - haven't yet grokked how to cleanly correlate the returned json to the current job, and don't know if API calls like this are appropriate for setup-gradle

      - name: grab job id
        run: |
          set -x
          job_url="https://api.github.com/repos/$GITHUB_REPOSITORY/actions/runs/$GITHUB_RUN_ID/jobs"
          job_json=$(curl -v -s -H "Authorization: token ${{ github.token }}" $job_url | jq)
          job_id=$(curl -v -s -H "Authorization: token ${{ github.token }}" $job_url | jq -r '.jobs[] | .id')
          echo $job_json
          echo $job_id

Log output from run here.

aSemy commented 1 month ago

Would using the concurrency group be possible, and useful?

It's possible to encode all inputs of the workflow as a concurrency group using ${{ join(inputs.*) }}.

Example: https://github.com/adamko-dev/dokkatoo/blob/2d1e8902ce550c1df8bad9c2be540eb7e9258097/.github/workflows/run_gradle_task.yml#L37-L40

cloudshiftchris commented 1 month ago

Good thoughts.

Looked at using input - it doesn't fully address the issue. Collisions would still occur if the same reusable workflow was invoked more than once with the same inputs.

Tried with concurrency-group - not seeing that visible anywhere in the available information; less that ideal for users to have to configure that (and know to do so) to avoid cache collisions.

What we really need is the full job name, as shown in the UI - validate-api / run-tests, validate-primary / run-tests - to indicate that "cache is for this specific job".

aSemy commented 1 month ago

Collisions would still occur if the same reusable workflow was invoked more than once with the same inputs.

There would be a collision, but would it be a problem? Am I right in thinking that the resulting cache entry will unnecessarily be updated, but at least it should contain the same dependencies?

Also, is it likely that a re-useable workflow with the same inputs will be called twice?

cloudshiftchris commented 1 month ago

Am I right in thinking that the resulting cache entry will unnecessarily be updated, but at least it should contain the same dependencies?

What happens on a GUH collision is that the cache is not updated (for the second job) - it loads cache entries as they existed for the first run, but doesn't store anything on the premise that, because the job key is the same, everything else must be the same (which would work if the job was uniquely identified).

Could imagine scenarios where reusable workflows are called multiple times with the same inputs (or no inputs, if all one desires is the side-effects). There's drawbacks to using inputs - if they change it would invalidate the Gradle cache, which isn't the intention as its the same job (but we don't have a full job id/name to determine that).

bigdaz commented 1 month ago

The cache entry for a key is immutable once written. So there's no way to "update" the cache entry without creating a new cache key. This is why we try hard to create a unique cache key for each Job execution (using workflow-name + job-name + matrix-values + Git SHA).

To avoid a massive explosion of cache usage, we first extract any content that we think might be shared between cache entries. These extracted entries are keyed only on their content, so you'll only get sharing if the content is exactly the same between to Job runs. This is commonly the case for subsequent runs of the same Gradle build (same Job), and less commonly occurs for different Jobs with similar content.

@cloudshiftchris I'd be interested to see how things work if the calling Jobs (validate-api, validate-primary somehow had different Matrix parameters defined (even if these aren't used). The cache key should include a hash of the Matrix values for a Job execution, but I'm not sure if these are also lost in a reusable workflow.

bigdaz commented 1 month ago

It's worth reading How Gradle User Home caching works, as it describes the properties of the GitHub Actions cache that led to the current model of caching in setup-gradle.

bigdaz commented 1 month ago

Oh wow, this (getting the full job name) is way more complicated than it should be.

:-)

There are a couple of open issues on this:

Setup repo to reproduce this (one workflow, two jobs that call a reusable workflow) with lots of debugging - the only place the full job name is evident is in an early job entry: Complete job name: build-first / run-tests - none of the contexts or env vars have the build-first portion, only the current job name of run-tests.

Thanks so much for taking the time to research. This really helps me out.

This snippet returns something that may be useful (it has build-first / run-tests) - haven't yet grokked how to cleanly correlate the returned json to the current job, and don't know if API calls like this are appropriate for setup-gradle

      - name: grab job id
        run: |
          set -x
          job_url="https://api.github.com/repos/$GITHUB_REPOSITORY/actions/runs/$GITHUB_RUN_ID/jobs"
          job_json=$(curl -v -s -H "Authorization: token ${{ github.token }}" $job_url | jq)
          job_id=$(curl -v -s -H "Authorization: token ${{ github.token }}" $job_url | jq -r '.jobs[] | .id')
          echo $job_json
          echo $job_id

Log output from run here.

I think we might be able to adapt something like this to run inside the action. While I'd prefer not to have to make an API call, if it can provide all of the context required to create a unique cache key, it could be worth it.

Can you try adding some matrix parameters to your test and see if they are also included in the JSON? I'm currently accessing them like this, which works but ideally I wouldn't need this layer of indirection.

bigdaz commented 1 month ago

Ah, now I see that I can only obtain the set of all Jobs running, and there's no way that I can tell to determine which one matches the currently executing Job!

https://stackoverflow.com/questions/71240338/obtain-job-id-from-a-workflow-run-using-contexts

cloudshiftchris commented 1 month ago

Yea, that's where I ran into the brick wall as well, its unclear / imprecise how to match things up :(

cloudshiftchris commented 1 month ago

@bigdaz

Can you try adding some matrix parameters to your test and see if they are also included in the JSON? I'm currently accessing them like this, which works but ideally I wouldn't need this layer of indirection.

Tried two variations on matrix jobs here; findings:

1) Reusable workflows (see sample below) show no usable indication of the matrix values; as with the job id above the matrix information is only available in the "Complete job name" debug output and the "jobs" api calls that we're unable to correlate with. This include using the same method as setup-gradle uses - its likely (subtly) broken for reusable workflows.

  build-matrix:
    strategy:
      matrix:
        a-matrix-dimension: [ 10, 12 ]
    uses: ./.github/workflows/run-gradle.yml
      - name: Dump matrix context
        env:
          MATRIX_CONTEXT: ${{ toJson(matrix) }}
        run: echo "$MATRIX_CONTEXT"
##[group]Run echo "$MATRIX_CONTEXT"
echo "$MATRIX_CONTEXT"
shell: /usr/bin/bash -e {0}
env:
  MATRIX_CONTEXT: null
##[endgroup]

2) Jobs that have steps (sample below) can directly see the matrix context:

  build-matrix-2:
    strategy:
      matrix:
        a-matrix-dimension: [ 10, 12 ]
    runs-on: ubuntu-latest
    steps:
      - name: Dump matrix context
        env:
          MATRIX_CONTEXT: ${{ toJson(matrix) }}
        run: echo "$MATRIX_CONTEXT"
      - name: Dump matrix context
        env:
          MATRIX_CONTEXT: ${{ toJson(matrix) }}
        run: echo "$MATRIX_CONTEXT"
2024-08-01T14:07:21.1119211Z ##[group]Run echo "$MATRIX_CONTEXT"
2024-08-01T14:07:21.1120721Z echo "$MATRIX_CONTEXT"
2024-08-01T14:07:21.1469240Z shell: /usr/bin/bash -e {0}
2024-08-01T14:07:21.1469770Z env:
2024-08-01T14:07:21.1470353Z   MATRIX_CONTEXT: {
  "a-matrix-dimension": 10
}
2024-08-01T14:07:21.1470948Z ##[endgroup]
2024-08-01T14:07:21.1873715Z {
2024-08-01T14:07:21.1875356Z   "a-matrix-dimension": 10
2024-08-01T14:07:21.1875945Z }
bigdaz commented 1 month ago

After a bit more investigation, I'm convinced that we should be including the inputs for a reusable workflow in the Gradle User Home cache key, similar to how we currently include the matrix values. This would allow us to differentiate between different "instances" of the same Job in the workflow.

As far as I can tell, the only way to differentiate between different calls to a reusable workflow are via inputs: environment variables aren't automatically passed in, so I think it's fair to differentiate based on inputs.

I forked the reproducer to include more dimensions in the workflow:

Here's a run with the default configuration. Each of the * / run-tests (--scan) jobs shares a cache key, as do each of the * / run-tests (--stacktrace) jobs. This makes sense, since only the [--scan, --stacktrace] matrix is being supplied in the workflow-job-context input parameter.

Here's a workflow run where the workflow inputs are explicitly added to the workflow-job-context parameter. Each of the Job executions has a unique Gradle User Home cache key, as desired.

Unfortunately, I can't (yet) find a way to access the inputs directly from the action implementation, and this needs to be configured by the user. I can access the matrix values like this, but adding the inputs here seems to have no impact. Grrrr.

So we have a solution for reusable workflows, but it requires a user to know this is required and to supply the inputs specifically to the action. (Note that passing in the "complete job name" value would also work: just need something that uniquely identifies the job execution).