Open blunderbussatya opened 8 months ago
Yes, we can include an option to bypass the storage check if users verify their setup is accurate.
I'm willing to provide mentorship for this issue.
Implement an option to bypass the storage check if users verify their setup is accurate.
@xuanwo is willing to offer mentorship to help address this issue effectively. @xuanwo promises to respond to your questions regarding the implementation of this feature within 24 hours, excluding weekends.
After finishing this issue, you will
I would be interested!
My contact info is at cceckman.com.
I would be interested!
Thank you very much for your interest! Mails coming in.
Some notes from investigation & discussion with @Xuanwo so far:
Storage::check
. The issue comes in the impl Storage for opendal::Operator
, where the object is written to test the access to the bucket.The task is to bypass that check - avoid the Storage::check
call at all, if a appropriately configured.
"Appropriately configured" is not just e.g. GCS_RW_MODE
(which is GCS-only anyway), but a new option.
The option will need to explicitly state "RO" or "RW", since the check
output is explicitly used. Effectively, the option takes the place of the check
call.
The next steps I'm looking at:
Poke into the configuration code, to see how to add a new option. (Seems fairly straightforward, though bikeshedding is always fun: SCCACHE_ASSUME_CACHE_RW_MODE=(RO,RW)
?)
Build a reproduction harness, that has at least a high probability of hitting the limit. I have GCS access I can use, though I'm not sure what the current test setup is for e.g. CI tests; I see there's some examples in test/
that I'll look at, to try to make something that can be integrated.
Get a positive repro, and (several) negative repros with the new option.
config.rs
defines the CacheModeConfig
type, which is Serialize
, Deserialize
, and Into<CacheMode>
.
So I think we want an Option<CacheModeConfig>
somewhere in Config
, that the server pulls on startup; falling back to the result of .check()
if the option is None
.
I'm unclear on is where to land this. The Config
type has cache: Option<CacheType>
alongside e.g. dist: DistConfig
; CacheType
is an enum over the various possible backends. There's not really a place to hang an "apply to all backends" flag other than the top-level....
But I guess if I think of it as something like server_startup_timeout
, i.e. a setting on server startup rather than on the cache, then it makes sense to put it "at the same level".
There's not really a place to hang an "apply to all backends" flag other than the top-level....
Yes, I think placing it along with server_startup_timeout
looks better.
OK - I have been able to ~manually reproduce this issue, and confirm that #2133 is a fix. I don't think the test is quite production-quality though.
This test script configures sccache to (a) use my personal GCS bucket for testing, and (b) ASSUME_RW_MODE=READ_WRITE
. It invokes this test case, which launches 100 sccache servers concurrently and runs show-stats
on all of them.
When I ran the tests without my changes patched in, I saw this error:
storage write check failed: Unexpected (permanent) at Writer::close => GcsErrorResponse { error: GcsError { code: 429, message: "The object sccache-dev.cceckman.com/.sccache_check exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", errors: [GcsErrorDetail { domain: "usageLimits", location: "", location_type: "", message: "The object sccache-dev.cceckman.com/.sccache_check exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", reason: "rateLimitExceeded" }] } }
Context:
uri: https://storage.googleapis.com/upload/storage/v1/b/sccache-dev.cceckman.com/o?uploadType=media&name=.sccache_check
response: Parts { status: 429, version: HTTP/1.1, headers: {"x-guploader-uploadid": "ABPtcPqbL3v50VAMSqL6_ZjFxj2uDeoynIihcV-rphUg4aUrgdfKA-loIRJpUvbsJk84mL5ysnY", "content-type": "application/json; charset=UTF-8", "date": "Mon, 18 Mar 2024 20:41:27 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "content-length": "625", "server": "UploadServer", "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000"} }
service: gcs
path: .sccache_check
This appears to be from the server's stderr, which I have configured to stream to the same output. This message disappeared with my patch, and with the ASSUME_RW_MODE
flag in the environment.
The test itself didn't fail, though, which means this isn't a regression test. I want to:
show-stats
call? Will get get a nonzero server exit status? ...can we do something other than just reading stderr
?storage_check_gcs.sh
to use...hopefully there's an existing GCS bucket, service account / workload identity, etc. for CI workflows?cc @satyajeet104, would you like to give this PR https://github.com/mozilla/sccache/pull/2133 a try and give us some feedbacks?
Thanks a lot guys, look good to me!
As far as I can tell from cd .github; rg SCCACHE_GCS
, there's no CI testing of the GCS integration. My tests so far are against a personal bucket; if there's a Mozilla org, bucket, etc. with appropriate custodianship, then I'd integrate with that. @Xuanwo @sylvestre Do you know if any such thing exists?
Otherwise, the test strategy will have to be based on "some other storage" which can force the error. "force the error" would be.... error if .sccache_check
is accessed but allow other ops through normally? (Getting really fancy with FUSE....)
Or, I guess, check that it assumes "write" mode on a read-only filesystem storage- and check that it fails when it "needs" to do a write for something. My next test: if I point --zero-stats
at a storage where the user doesn't have write permissions, what happens? Would that let us distinguish the ASSUME_RW
case ~cheaply (without a complicated harness)?
I'm admittedly getting a little lost in working out how to reliably test this >.<
Otherwise, the test strategy will have to be based on "some other storage" which can force the error. "force the error" would be.... error if
.sccache_check
is accessed but allow other ops through normally? (Getting really fancy with FUSE....)
We can test against s3 storage services. They should work like the same.
My next test: if I point
--zero-stats
at a storage where the user doesn't have write permissions, what happens?
zero-stats
just clean up the in-memory stats, no changes to storage.
We're using sccache v0.7.6, we have a lot of builds running in parallel and each run invokes sccache which in turn tries to do a sccache RW check by reading and writing .sccache_check dummy file. While this is fine for reading but for writing we violate the quota on GCS which allows write on the same object once per second.
Is there a way to bypass this check? We can perhaps create an env var to support this use case?