praetorian-inc / noseyparker

Nosey Parker is a command-line program that finds secrets and sensitive information in textual data and Git history.
Apache License 2.0
1.56k stars 77 forks source link

Caching in GitHub Actions doesn't seem to be working right #163

Open bradlarsen opened 3 months ago

bradlarsen commented 3 months ago

Describe the bug The GitHub Actions workflows for Nosey Parker don't seem to make effective use of the Actions cache. This causes workflows to run much slower than they might otherwise.

To Reproduce For example, take a successful CI run, and then rerun it. Examine the details of the actions log. Additionally, check the build timing artifacts that are produced by the CI jobs; these give more details as to how many dependencies are being rebuilt by cargo and how long they are taking.

Expected behavior You would expect that the jobs would complete quickly, since nothing has changed and the cache has been populated.

Actual behavior The jobs build nearly everything from scratch, even when cache entries are found.

bradlarsen commented 3 months ago

The caching also doesn't seem to be working as one would hope within the Docker image builder workflows. There, however, you have an additional caching mechanism to worry about beyond just GitHub Actions cache: the Docker cache.

I believe things are plumbed properly so that Docker will attempt to use GitHub Actions' cache, but the underlying caching problem reported above is surely impacting that as well.

bradlarsen commented 3 months ago

In #164, I've improved the caching situation in regular GitHub Actions some, giving a 30-50% reduction in build times when a cache hit occurs.

To fix the caching problems in Docker, more work is involved. See here for some ideas: https://gist.github.com/noelbundick/6922d26667616e2ba5c3aff59f0824cd

bradlarsen commented 3 months ago

Note for Docker caching problems in CI: It looks like building the Debian-based Docker images does apt install every time instead of ever caching that layer, which invalidates everything following.

seqre commented 3 months ago

Screenshot 2024-04-04 at 9 14 54 AM The GitHub Cache limit might be part of the problem. I sorted all cache entries by least recently used, and all entries were used within the last day/two (even entries month old). Since we hit the limit, some caching entries are getting removed, reducing their usefulness (I think, but I'm not entirely sure).

bradlarsen commented 3 months ago

I recently split out Vectorscan into its own published crate. Now in the CI jobs, that long-building package can be cached, and cached build job times are all under 2 minutes. For example:

https://github.com/praetorian-inc/noseyparker/actions/runs/8594814073

image
bradlarsen commented 3 months ago

Relevant discussion I saw over the weekend about caching and Docker builds in GitHub Actions: https://news.ycombinator.com/item?id=39956327. It sounds like the caching scheme we are currently using (GitHub Actions cache for Docker) is doomed, as the cache is too small for the images we are building.

One possible alternative is to use the registry cache mode for the builder job. This will use an external location (like the GHCR registry) for caching, instead of GitHub Actions. Needs investigation.

bradlarsen commented 2 months ago

I experimented a bunch in #169, but failed to find a GitHub Actions incantation that resulted in Docker image builds that actually effectively cached the build stages.

Caching works better now for regular CI jobs, but for jobs that build Docker images, the caching mostly doesn't work at all.

seqre commented 2 months ago

Going back to the matter I brought up earlier, I still think the cache-from directive should be added there. It doesn't create any additional cache but only could use the cache generated by this part. Seems like a small possibility of improvement with no drawbacks.


The other problem I see with changing Docker cache mode to min is that it won't be very effective. Per official docs:

In min cache mode (the default), only layers that are exported into the resulting image are cached

The current Dockerfiles setup will cache only apt-get actions, as the final image is the runner image with the copied binary from the builder phase. To actually leverage caching, I see two options:

  1. Keep min and split builder and runner Dockerfile stages (at least for Debian, Alpine needs to be separated into more stages first) into separate files so each stage is the "final stage." This is a small improvement, but as long as dependencies won't change, it should shave off quite a lot of CI time.
  2. Go back to max and figure out the caching (probably using the registry).
bradlarsen commented 2 months ago

Going back to the https://github.com/praetorian-inc/noseyparker/pull/169#discussion_r1557906813, I still think the cache-from directive should be added there. It doesn't create any additional cache but only could use the cache generated by this part. Seems like a small possibility of improvement with no drawbacks.

Oh, I understand what you're saying about that now. Read from the cache if possible, but don't write to it in release builds. I suspect that it's not going to do anything today in terms of cache benefit. But worth a try.