Open indygreg opened 1 year ago
Salut @indygreg nice to see you here :)
@jakelee8 does it ring a bell?
More worrisome though is that some of the cache writes take 10+s. e.g.
I might be wrong but I think it also includes the compilation time. Side note: maybe we should have compilation time + cache time displayed in the logs
The 429 Too Many Requests
error is a legitimate issue. Do y'all think it would help to implement a cache write queue to limit parallelism?
From what I've seen, it looks like compile time is included in the write time. Cache reads are fast at 0.000 s
. Here's an example of one of our internal builds:
sccache 0.3.1
Compile requests 722
Compile requests executed 608
Cache hits 539
Cache hits (C/C++) 100
Cache hits (Rust) 439
Cache misses 65
Cache misses (C/C++) 15
Cache misses (Rust) 50
Cache timeouts 0
Cache read errors 0
Forced recaches 0
Cache write errors 0
Compilation failures 4
Cache errors 0
Non-cacheable compilations 0
Non-cacheable calls 114
Non-compilation calls 0
Unsupported compiler calls 0
Average cache write 1.379 s
Average cache read miss 1.379 s
Average cache read hit 0.000 s
Failed distributed compilations 0
Non-cacheable reasons:
crate-type 83
unknown source language 24
- 4
-E 2
argument parse 1
Also, did you set the environmental variable CARGO_INCREMENTAL=false
?
From the README:
Incrementally compiled crates cannot be cached. By default, in the debug profile Cargo will use incremental compilation for workspace members and path dependencies. You can disable incremental compilation.
More worrisome though is that some of the cache writes take 10+s. e.g.
FWIW, these logs are wrong. PR #1495 will hopefully fix that.
I do not set CARGO_INCREMENTAL=false
in these CI jobs. Maybe that explains some regressions in caching in recent months?!
(Also, I'm not receiving emails for updates in this repository for some reason. I suspect they may be getting routed to my defunct mozilla.com email. But I don't see any references to that email in my GitHub account. So unsure what's going on. Apologies if I'm tardy to reply - I probably didn't see the activity!)
I'm seeing similar issues with HTTP error 429, and can confirm that we're setting CARGO_INCREMENTAL=0
:
2022-12-22:13: DEBUG reqwest::connect] starting new connection: https://artifactcache.actions.githubusercontent.com/
2022-12-22:13: DEBUG hyper::client::connect::dns] resolving host="artifactcache.actions.githubusercontent.com"
2022-12-22:13: DEBUG hyper::client::connect::http] connecting to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::client::connect::http] connected to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::proto::h1::io] flushed 2214 bytes
2022-12-22:13: DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 3 (3 bytes)
2022-12-22:13: WARN reqwest_retry::middleware] Retry attempt #1. Sleeping 265.s before the next attempt
2022-12-22:13: DEBUG reqwest::connect] starting new connection: https://artifactcache.actions.githubusercontent.com/
2022-12-22:13: DEBUG hyper::client::connect::dns] resolving host="artifactcache.actions.githubusercontent.com"
2022-12-22:13: DEBUG hyper::client::connect::http] connecting to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::client::connect::http] connected to 13.107.42.16:443
2022-12-22:13: DEBUG hyper::proto::h1::io] flushed 2214 bytes
2022-12-22:13: DEBUG hyper::proto::h1::io] parsed 18 headers
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body is chunked encoding
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 3 (3 bytes)
2022-12-22:13: INFO reqwest_tracing::reqwest_otel_span_builder] HTTP request; otel.status_code="ERROR"
2022-12-22:13: INFO reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.status_code=429
2022-12-22:13: INFO reqwest_tracing::reqwest_otel_span_builder] HTTP request; http.user_agent=""
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::decode] incoming chunked header: 138 (312 bytes)
2022-12-22:13: DEBUG hyper::proto::h1::conn] incoming body completed
2022-12-22:13: DEBUG hyper::client::pool] pooling idle connection for ("https", artifactcache.actions.githubusercontent.com)
2022-12-22:13: DEBUG sccache::compiler::compiler] [unicode_ident]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
2022-12-22:13: DEBUG sccache::server] [unicode_ident]: Cache write finished in 2.982 s
I'm surprised that it still reports that the cache write is finished though 🤔
@rajivshah3 in the future, please strip the color string, it makes the log harder to read ;)
For cache write, we don't need to be super fast. Maybe a queue would do. @Xuanwo do you think you could do that once you added the opendal support for GHA? (or as part of the change?)
@Xuanwo do you think you could do that once you added the opendal support for GHA? (or as part of the change?)
Sure! I will add rate limit handling after implement https://github.com/datafuselabs/opendal/pull/1111
https://github.com/datafuselabs/opendal/pull/1111 has passed all opendal's tests now! It will come into sccache soon.
@Xuanwo you are auto approving your patches in opendal ?
@Xuanwo you are auto approving your patches in opendal ?
Yep, most PRs are merged by me after all check passed.
ok :( this is a bit surprising ...
ok :( this is a bit surprising ...
I expect to improve this part while we have more maintainers~
yes, please do, this should not be done on important project
Hi, @indygreg, I re-produced you problem with the v0.4.0-pre.5 release at https://github.com/Xuanwo/apple-platform-rs/actions/runs/3871176193/jobs/6598713731
The real problem here is ghac will return a very long Retry-After
which could up to 60s. So if we follow this limit strictly, we will observe the surprisingly slow.
The full error message here:
sccache: error: Server startup failed: cache storage failed to write: Unexpected (permanent) at write => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
Context:
response: Parts { status: 429, version: HTTP/1.1, headers: {"transfer-encoding": "chunked", "content-type": "application/json", "retry-after": "60", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "daad0d22-be5d-4106-9f45-205eb7eedac3", "activityid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-tfs-session": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-e2eid": "ec2715a2-3f3a-47e2-ac7a-efad9555e782", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "758", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1673244953", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27.", "x-cache": "CONFIG_NOCACHE", "x-msedge-ref": "Ref A: 489C21A91BE741E29D754583C16E0C7C Ref B: PAOEDGE0516 Ref C: 2023-01-09T06:11:06Z", "date": "Mon, 09 Jan 2023 06:11:06 GMT"} }
called: Backend::ghac_reserve
service: ghac
path: .sccache_check
Returns a cache miss or give up this cache write maybe a better choice.
Hi, @indygreg. Here are our updates on addressing this issue.
I have a test with my two patches: https://github.com/mozilla/sccache/pull/1557 & https://github.com/mozilla/sccache/pull/1550 on apple-platform-rs.
The biggest lesson I have learned from this is that sccache is a cache service, and we don't need to ensure every write/read succeeds. So instead of waiting to retry limited requests, we will skip this step and keep going.
Here are the differences:
No failed build anymore
The action is here https://github.com/Xuanwo/apple-platform-rs/pull/1
(I'm still not receiving email notifications from Mozilla orgs due to a GitHub bug involving an orphaned email forwarding rule leftover from when I worked at Mozilla.)
The updates here look great and I hope to test out the GHA backend again soon. The new times from the previous comment appear no worse than what I'm getting with 0.3.3 on S3 today. Considering GHA enables me to enable sccache on PRs without leaking my cloud provider credentials, this looks like a very compelling feature now! Thanks for all your work!
@indygreg Please take v0.4.0-pre.6 a try~
Hi @Xuanwo , is there anything that can be done to improve this? On v0.4.0-pre.6 we're still seeing a lot of 429 errors in the logs (on both reads and writes) and there isn't much of a difference in compilation time (33 min uncached vs 29 min cached). The logs from sccache
are uploaded as build artifacts if that would help. I wonder if it would be worth having an opt-in feature to try the failed writes again in the background once the compilation is done?
On v0.4.0-pre.6 we're still seeing a lot of 429 errors in the logs (on both reads and writes)
This is the github's limit and I don't how to overcome them so far.
here isn't much of a difference in compilation time (33 min uncached vs 29 min cached).
Sadly, sccache can't cache the heaviest dep rocksdb
in your project. Based on my own experience, a full release build of rocksdb will always take from 20 min to 30 min. Maybe you can try run tests in debug build instead of release build.
I wonder if it would be worth having an opt-in feature to try the failed writes again in the background once the compilation is done?
The time could be very long(
Hi! Did the caching behavior change a lot from 0.3.3 to 0.4.0-pre.7? Locally, running
cargo llvm-cov nextest --lcov --output-path lcov.info --locked --workspace --all-features --partition hash:1/3
Yields these stats:
Compile requests 1359
Compile requests executed 1033
Cache hits 791
Cache hits (C/C++) 46
Cache hits (Rust) 745
Cache misses 238
Cache misses (C/C++) 2
Cache misses (Rust) 236
Cache timeouts 0
Cache read errors 0
Forced recaches 0
Cache write errors 0
Compilation failures 4
Cache errors 0
Non-cacheable compilations 0
Non-cacheable calls 314
Non-compilation calls 12
Unsupported compiler calls 0
Average cache write 0.857 s
Average cache read miss 0.857 s
Average cache read hit 0.000 s
Failed distributed compilations 0
Non-cacheable reasons:
crate-type 280
unknown source language 22
- 8
argument parse 3
missing input 1
Cache location Local disk: "/home/oliver/.cache/sccache"
Cache size 10 GiB
Max cache size 20 GiB
But on CI (see https://github.com/paradigmxyz/reth/pull/1355) these numbers are vastly different:
Compile requests 733
Compile requests executed 537
Cache hits 300
Cache hits (C/C++) 4
Cache hits (Rust) 296
Cache misses 235
Cache misses (C/C++) 20
Cache misses (Rust) 215
Cache timeouts 0
Cache read errors 0
Forced recaches 0
Cache write errors 0
Compilation failures 2
Cache errors 0
Non-cacheable compilations 0
Non-cacheable calls 189
Non-compilation calls 7
Unsupported compiler calls 0
Average cache write 0.078 s
Average compiler 0.947 s
Average cache read hit 0.000 s
Failed distributed compilations 0
Non-cacheable reasons:
crate-type 168
unknown source language 11
- 6
argument parse 3
missing input 1
Cache location ghac, name: sccache-v0.4.0-pre.7-sccache, prefix: /sccache/
Version (client) 0.4.0-pre.7
The cache seems to be populated, writes and reads are reported as pretty fast, but the build is a lot slower than using rust-cache - some of our jobs went from 5 mins build time to 12-13 mins (see main CI).
I based the workflow changes off of https://github.com/Xuanwo/apple-platform-rs/pull/1
Hi! Did the caching behavior change a lot from 0.3.3 to 0.4.0-pre.7? Locally, running
I have tried this command over sccache v0.3.3
but get:
Stopping sccache server...
Compile requests 733
Compile requests executed 537
Cache hits 0
Cache misses 535
Cache misses (C/C++) 24
Cache misses (Rust) 511
Cache timeouts 0
Cache read errors 0
Forced recaches 0
Cache write errors 0
Compilation failures 2
Cache errors 0
Non-cacheable compilations 0
Non-cacheable calls 189
Non-compilation calls 7
Unsupported compiler calls 0
Average cache write 0.776 s
Average cache read miss 0.776 s
Average cache read hit 0.000 s
Failed distributed compilations 0
Non-cacheable reasons:
crate-type 168
unknown source language 11
- 6
argument parse 3
missing input 1
Cache location Local disk: "/home/xuanwo/.cache/sccache"
Cache size 680 MiB
Max cache size 10 GiB
The compile requests is the same. Are you running the same command? By the way, maybe it's better to build tests with sccache, upload to artifacts, and than running in different hash part.
@Xuanwo Unfortunately using artifacts is a lot slower since the artifact upload speed is abysmal :( The artifact upload takes as long (or longer) as the build does.
I am running the same command locally, I'll try a fresh clone, maybe something is messed with my local copy of the repository I am building
Do you know/have any pointers on what could make sccache slower than rust-cache for the linked PR?
Do you know/have any pointers on what could make sccache slower than rust-cache for the linked PR?
No ideas so far. Maybe we are hitting the github action cache's rate limit.
I thought so too, but sccace --show-stats
does not show any read or write errors. Maybe they are not reported for GHAC?
I was excited to see the introduction of the GitHub Actions cache backend. However, when I tried it on a test branch, the results were surprisingly slow.
See the initial run at https://github.com/indygreg/apple-platform-rs/actions/runs/3723171428/jobs/6314513850. Here, a build job took ~1 hour. (It normally takes ~10m.) The
Stop sccache
step displays sccache server stats. This particular job shows 44cache errors
. Unsure if that is possibly the cause of problems?I did another CI run where I cranked up logging to debug and printed the log file after the build. See e.g. https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814370 for a slow build. Interestingly, some of the jobs in this run were fast! See https://github.com/indygreg/apple-platform-rs/actions/runs/3723354206/jobs/6314814955 for a fast one. Spot checking the logs, I think there is a positive correlation between the
cache errors
count and job time.Poking around the logs, I see occurrences of
DEBUG sccache::compiler::compiler] [aws_http]: Cache write error: Cache service responded with 429 Too Many Requests: {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
. So I guess my project is too big to use GitHub Actions cache???More worrisome though is that some of the cache writes take 10+s. e.g.
And
I'm not sure how to interpret these logs. But multi-second values for
Cache write finished in
seem wrong: I would expect a cache write to effectively be an HTTP request and for inserts to take dozens to hundreds of milliseconds. But maybe the GitHub Actions cache latency is just really high?I'm unsure what actions are needed for this issue. I'm unsure if there's any code-level bugs. Maybe GitHub Actions cache is just prohibitively slow? Maybe this slowness/behavior should be documented?