mozilla / sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Apache License 2.0
5.8k stars 546 forks source link

GitHub Actions cache is surprisingly slow #1485

Open indygreg opened 1 year ago

indygreg commented 1 year ago

I was excited to see the introduction of the GitHub Actions cache backend. However, when I tried it on a test branch, the results were surprisingly slow.

See the initial run at https://github.com/indygreg/apple-platform-rs/actions/runs/3723171428/jobs/6314513850. Here, a build job took ~1 hour. (It normally takes ~10m.) The Stop sccache step displays sccache server stats. This particular job shows 44 cache errors. Unsure if that is possibly the cause of problems?

I did another CI run where I cranked up logging to debug and printed the log file after the build. See e.g. https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814370 for a slow build. Interestingly, some of the jobs in this run were fast! See https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814955 for a fast one. Spot checking the logs, I think there is a positive correlation between the cache errors count and job time.

Poking around the logs, I see occurrences of DEBUG sccache::compiler::compiler] [aws_http]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}. So I guess my project is too big to use GitHub Actions cache???

More worrisome though is that some of the cache writes take 10+s. e.g.

2022-12-18T05:13:25.3938134Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] reuse idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3938390Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] flushed 2318 bytes
2022-12-18T05:13:25.3938682Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-18T05:13:25.3939035Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-18T05:13:25.3939330Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x3 (3 bytes)
2022-12-18T05:13:25.3939686Z [2022-12-18T05:08:46Z WARN  reqwest_retry::middleware] Retry attempt #1. Sleeping 53.097849ms before the next attempt
2022-12-18T05:13:25.3940103Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] reuse idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3940353Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] flushed 2318 bytes
2022-12-18T05:13:25.3940598Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-18T05:13:25.3940878Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-18T05:13:25.3941175Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x3 (3 bytes)
2022-12-18T05:13:25.3941508Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; otel.status_code="ERROR"
2022-12-18T05:13:25.3941845Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=429
2022-12-18T05:13:25.3942174Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-18T05:13:25.3942478Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)
2022-12-18T05:13:25.3942793Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::conn] incoming body completed
2022-12-18T05:13:25.3943218Z [2022-12-18T05:08:46Z DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3944403Z [2022-12-18T05:08:46Z DEBUG sccache::compiler::compiler] [aws_sdk_s3]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
2022-12-18T05:13:25.3944708Z [2022-12-18T05:08:46Z DEBUG sccache::server] [aws_sdk_s3]: Cache write finished in 21.388 s
2022-12-18T05:13:25.3945012Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)
2022-12-18T05:13:25.3945338Z [2022-12-18T05:08:46Z DEBUG hyper::proto::h1::decode] incoming chunked header: 0x138 (312 bytes)

And

2022-12-18T05:13:25.3908847Z [2022-12-18T05:08:38Z DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-18T05:13:25.3909181Z [2022-12-18T05:08:38Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=204
2022-12-18T05:13:25.3909491Z [2022-12-18T05:08:38Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-18T05:13:25.3909910Z [2022-12-18T05:08:38Z DEBUG sccache::compiler::compiler] [aws_config]: Stored in cache successfully!
2022-12-18T05:13:25.3910204Z [2022-12-18T05:08:38Z DEBUG sccache::server] [aws_config]: Cache write finished in 13.018 s
2022-12-18T05:13:25.3910535Z [2022-12-18T05:08:45Z DEBUG sccache::compiler::compiler] [aws_sdk_s3]: Compiled in 21.388 s, storing in cache
2022-12-18T05:13:25.3910931Z [2022-12-18T05:08:46Z INFO  gha_toolkit::cache] put; version="5c35548282b30b97a2314eadc6fc42b9b7585a12c4f9efa527ac2e48739babe0"
2022-12-18T05:13:25.3911660Z [2022-12-18T05:08:46Z INFO  gha_toolkit::cache] reserve; key="sccache-ubuntu-22.04-x86_64-unknown-linux-gnu-1.62.1-0300a5abb7bd36eb7e4aa0b971d5e94866c6e140" version="7c4ed3dc414c5c2d030ec00d8394e733c5dfeff8cdbae21becc2144de0bb9570" cache_size=21844888
2022-12-18T05:13:25.3912483Z [2022-12-18T05:08:46Z INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.method=POST http.scheme=https http.host=artifactcache.actions.githubusercontent.com net.host.port=0 otel.kind="client" otel.name=POST /LNWNXsg1z0Qt0Mn41fuRdbA1YzcaL5Z2Wf2AYmEQiLGLuQCklb/_apis/artifactcache/caches

I'm not sure how to interpret these logs. But multi-second values for Cache write finished in seem wrong: I would expect a cache write to effectively be an HTTP request and for inserts to take dozens to hundreds of milliseconds. But maybe the GitHub Actions cache latency is just really high?

I'm unsure what actions are needed for this issue. I'm unsure if there's any code-level bugs. Maybe GitHub Actions cache is just prohibitively slow? Maybe this slowness/behavior should be documented?

sylvestre commented 1 year ago

Salut @indygreg nice to see you here :)

@jakelee8 does it ring a bell?

sylvestre commented 1 year ago

More worrisome though is that some of the cache writes take 10+s. e.g.

I might be wrong but I think it also includes the compilation time. Side note: maybe we should have compilation time + cache time displayed in the logs

jakelee8 commented 1 year ago

The 429 Too Many Requests error is a legitimate issue. Do y'all think it would help to implement a cache write queue to limit parallelism?

From what I've seen, it looks like compile time is included in the write time. Cache reads are fast at 0.000 s. Here's an example of one of our internal builds:

sccache 0.3.1
Compile requests                    722
Compile requests executed           608
Cache hits                          539
Cache hits (C/C++)                  100
Cache hits (Rust)                   439
Cache misses                         65
Cache misses (C/C++)                 15
Cache misses (Rust)                  50
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  4
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 114
Non-compilation calls                 0
Unsupported compiler calls            0
Average cache write               1.379 s
Average cache read miss           1.379 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                           83
unknown source language              24
-                                     4
-E                                    2
argument parse                        1
jakelee8 commented 1 year ago

Also, did you set the environmental variable CARGO_INCREMENTAL=false?

From the README:

Incrementally compiled crates cannot be cached. By default, in the debug profile Cargo will use incremental compilation for workspace members and path dependencies. You can disable incremental compilation.

https://github.com/mozilla/sccache#rust

glandium commented 1 year ago

More worrisome though is that some of the cache writes take 10+s. e.g.

FWIW, these logs are wrong. PR #1495 will hopefully fix that.

indygreg commented 1 year ago

I do not set CARGO_INCREMENTAL=false in these CI jobs. Maybe that explains some regressions in caching in recent months?!

(Also, I'm not receiving emails for updates in this repository for some reason. I suspect they may be getting routed to my defunct mozilla.com email. But I don't see any references to that email in my GitHub account. So unsure what's going on. Apologies if I'm tardy to reply - I probably didn't see the activity!)

rajivshah3 commented 1 year ago

I'm seeing similar issues with HTTP error 429, and can confirm that we're setting CARGO_INCREMENTAL=0:

2022-12-22:13: DEBUG reqwest::connect] starting new connection: https://artifactcache.actions.githubusercontent.com/
2022-12-22:13: DEBUG hyper::client::connect::dns] resolving host="artifactcache.actions.githubusercontent.com"
2022-12-22:13: DEBUG hyper::client::connect::http] connecting to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::client::connect::http] connected to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::proto::h1::io] flushed 2214 bytes
2022-12-22:13: DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 3 (3 bytes)
2022-12-22:13: WARN  reqwest_retry::middleware] Retry attempt #1. Sleeping 265.s before the next attempt
2022-12-22:13: DEBUG reqwest::connect] starting new connection: https://artifactcache.actions.githubusercontent.com/
2022-12-22:13: DEBUG hyper::client::connect::dns] resolving host="artifactcache.actions.githubusercontent.com"
2022-12-22:13: DEBUG hyper::client::connect::http] connecting to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::client::connect::http] connected to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::proto::h1::io] flushed 2214 bytes
2022-12-22:13: DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 3 (3 bytes)
2022-12-22:13: INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; otel.status_code="ERROR"
2022-12-22:13: INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=429
2022-12-22:13: INFO  reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body completed
2022-12-22:13: DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-22:13: DEBUG sccache::compiler::compiler] [unicode_ident]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
2022-12-22:13: DEBUG sccache::server] [unicode_ident]: Cache write finished in 2.982 s

I'm surprised that it still reports that the cache write is finished though 🤔

sylvestre commented 1 year ago

@rajivshah3 in the future, please strip the color string, it makes the log harder to read ;)

For cache write, we don't need to be super fast. Maybe a queue would do. @Xuanwo do you think you could do that once you added the opendal support for GHA? (or as part of the change?)

Xuanwo commented 1 year ago

@Xuanwo do you think you could do that once you added the opendal support for GHA? (or as part of the change?)

Sure! I will add rate limit handling after implement https://github.com/datafuselabs/opendal/pull/1111

Xuanwo commented 1 year ago

https://github.com/datafuselabs/opendal/pull/1111 has passed all opendal's tests now! It will come into sccache soon.

sylvestre commented 1 year ago

@Xuanwo you are auto approving your patches in opendal ?

Xuanwo commented 1 year ago

@Xuanwo you are auto approving your patches in opendal ?

Yep, most PRs are merged by me after all check passed.

sylvestre commented 1 year ago

ok :( this is a bit surprising ...

Xuanwo commented 1 year ago

ok :( this is a bit surprising ...

I expect to improve this part while we have more maintainers~

sylvestre commented 1 year ago

yes, please do, this should not be done on important project

Xuanwo commented 1 year ago

Hi, @indygreg, I re-produced you problem with the v0.4.0-pre.5 release at https://github.com/Xuanwo/apple-platform-rs/actions/runs/3871176193/jobs/6598713731

The real problem here is ghac will return a very long Retry-After which could up to 60s. So if we follow this limit strictly, we will observe the surprisingly slow.

The full error message here:

sccache: error: Server startup failed: cache storage failed to write: Unexpected (permanent) at write => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

Context:
    response: Parts { status: 429, version: HTTP/1.1, headers: {"transfer-encoding": "chunked", "content-type": "application/json", "retry-after": "60", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "daad0d22-be5d-4106-9f45-205eb7eedac3", "activityid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-tfs-session": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-e2eid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "758", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1673244953", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27.", "x-cache": "CONFIG_NOCACHE", "x-msedge-ref": "Ref A: 489C21A91BE741E29D754583C16E0C7C Ref B: PAOEDGE0516 Ref C: 2023-01-09T06:11:06Z", "date": "Mon, 09 Jan 2023 06:11:06 GMT"} }
    called: Backend::ghac_reserve
    service: ghac
    path: .sccache_check

Returns a cache miss or give up this cache write maybe a better choice.

Xuanwo commented 1 year ago

Hi, @indygreg. Here are our updates on addressing this issue.

I have a test with my two patches: https://github.com/mozilla/sccache/pull/1557 & https://github.com/mozilla/sccache/pull/1550 on apple-platform-rs.

The biggest lesson I have learned from this is that sccache is a cache service, and we don't need to ensure every write/read succeeds. So instead of waiting to retry limited requests, we will skip this step and keep going.

Here are the differences:

Before

image

After

No failed build anymore

image

The action is here https://github.com/Xuanwo/apple-platform-rs/pull/1

indygreg commented 1 year ago

(I'm still not receiving email notifications from Mozilla orgs due to a GitHub bug involving an orphaned email forwarding rule leftover from when I worked at Mozilla.)

The updates here look great and I hope to test out the GHA backend again soon. The new times from the previous comment appear no worse than what I'm getting with 0.3.3 on S3 today. Considering GHA enables me to enable sccache on PRs without leaking my cloud provider credentials, this looks like a very compelling feature now! Thanks for all your work!

Xuanwo commented 1 year ago

@indygreg Please take v0.4.0-pre.6 a try~

rajivshah3 commented 1 year ago

Hi @Xuanwo , is there anything that can be done to improve this? On v0.4.0-pre.6 we're still seeing a lot of 429 errors in the logs (on both reads and writes) and there isn't much of a difference in compilation time (33 min uncached vs 29 min cached). The logs from sccache are uploaded as build artifacts if that would help. I wonder if it would be worth having an opt-in feature to try the failed writes again in the background once the compilation is done?

Xuanwo commented 1 year ago

On v0.4.0-pre.6 we're still seeing a lot of 429 errors in the logs (on both reads and writes)

This is the github's limit and I don't how to overcome them so far.

here isn't much of a difference in compilation time (33 min uncached vs 29 min cached).

Sadly, sccache can't cache the heaviest dep rocksdb in your project. Based on my own experience, a full release build of rocksdb will always take from 20 min to 30 min. Maybe you can try run tests in debug build instead of release build.

I wonder if it would be worth having an opt-in feature to try the failed writes again in the background once the compilation is done?

The time could be very long(

onbjerg commented 1 year ago

Hi! Did the caching behavior change a lot from 0.3.3 to 0.4.0-pre.7? Locally, running

cargo llvm-cov nextest --lcov --output-path lcov.info --locked --workspace --all-features --partition hash:1/3

Yields these stats:

Compile requests                   1359
Compile requests executed          1033
Cache hits                          791
Cache hits (C/C++)                   46
Cache hits (Rust)                   745
Cache misses                        238
Cache misses (C/C++)                  2
Cache misses (Rust)                 236
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  4
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 314
Non-compilation calls                12
Unsupported compiler calls            0
Average cache write               0.857 s
Average cache read miss           0.857 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                          280
unknown source language              22
-                                     8
argument parse                        3
missing input                         1

Cache location                  Local disk: "/home/oliver/.cache/sccache"
Cache size                           10 GiB
Max cache size                       20 GiB

But on CI (see https://github.com/paradigmxyz/reth/pull/1355) these numbers are vastly different:

Compile requests                    733
Compile requests executed           537
Cache hits                          300
Cache hits (C/C++)                    4
Cache hits (Rust)                   296
Cache misses                        235
Cache misses (C/C++)                 20
Cache misses (Rust)                 215
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  2
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 189
Non-compilation calls                 7
Unsupported compiler calls            0
Average cache write               0.078 s
Average compiler                  0.947 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                          168
unknown source language              11
-                                     6
argument parse                        3
missing input                         1

Cache location                  ghac, name: sccache-v0.4.0-pre.7-sccache, prefix: /sccache/
Version (client)                0.4.0-pre.7

The cache seems to be populated, writes and reads are reported as pretty fast, but the build is a lot slower than using rust-cache - some of our jobs went from 5 mins build time to 12-13 mins (see main CI).

I based the workflow changes off of https://github.com/Xuanwo/apple-platform-rs/pull/1

Xuanwo commented 1 year ago

Hi! Did the caching behavior change a lot from 0.3.3 to 0.4.0-pre.7? Locally, running

I have tried this command over sccache v0.3.3 but get:

Stopping sccache server...
Compile requests                    733
Compile requests executed           537
Cache hits                            0
Cache misses                        535
Cache misses (C/C++)                 24
Cache misses (Rust)                 511
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  2
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                 189
Non-compilation calls                 7
Unsupported compiler calls            0
Average cache write               0.776 s
Average cache read miss           0.776 s
Average cache read hit            0.000 s
Failed distributed compilations       0

Non-cacheable reasons:
crate-type                          168
unknown source language              11
-                                     6
argument parse                        3
missing input                         1

Cache location                  Local disk: "/home/xuanwo/.cache/sccache"
Cache size                          680 MiB
Max cache size                       10 GiB

The compile requests is the same. Are you running the same command? By the way, maybe it's better to build tests with sccache, upload to artifacts, and than running in different hash part.

onbjerg commented 1 year ago

@Xuanwo Unfortunately using artifacts is a lot slower since the artifact upload speed is abysmal :( The artifact upload takes as long (or longer) as the build does.

I am running the same command locally, I'll try a fresh clone, maybe something is messed with my local copy of the repository I am building

Do you know/have any pointers on what could make sccache slower than rust-cache for the linked PR?

Xuanwo commented 1 year ago

Do you know/have any pointers on what could make sccache slower than rust-cache for the linked PR?

No ideas so far. Maybe we are hitting the github action cache's rate limit.

onbjerg commented 1 year ago

I thought so too, but sccace --show-stats does not show any read or write errors. Maybe they are not reported for GHAC?