Hitting rate limiting issues due to .sccache_check write checks on GCS

blunderbussatya commented 8 months ago

We're using sccache v0.7.6, we have a lot of builds running in parallel and each run invokes sccache which in turn tries to do a sccache RW check by reading and writing .sccache_check dummy file. While this is fine for reading but for writing we violate the quota on GCS which allows write on the same object once per second.

Is there a way to bypass this check? We can perhaps create an env var to support this use case?

Xuanwo commented 8 months ago

Yes, we can include an option to bypass the storage check if users verify their setup is accurate.

Xuanwo commented 8 months ago

I'm willing to provide mentorship for this issue.

Task

Implement an option to bypass the storage check if users verify their setup is accurate.

Mentorship

@xuanwo is willing to offer mentorship to help address this issue effectively. @xuanwo promises to respond to your questions regarding the implementation of this feature within 24 hours, excluding weekends.

After finishing this issue, you will

Know more about rust!
Know more about sccache's stroage layer

Notes

Leaving comment to take part in.
It's an easy task. If you're already an experienced Rust developer, you might want to check out this for more interesting tasks.

cceckman commented 8 months ago

I would be interested!

My contact info is at cceckman.com.

Xuanwo commented 8 months ago

I would be interested!

Thank you very much for your interest! Mails coming in.

cceckman commented 8 months ago

Some notes from investigation & discussion with @Xuanwo so far:

The relevant entry point is Storage::check. The issue comes in the impl Storage for opendal::Operator, where the object is written to test the access to the bucket.
Per @Xuanwo , this check is implemented as a "belt and suspenders" approach. Instead of giving a bunch of errors on individual object writes, this check is run at server start - and fails if the mode is not determinable.
The task is to bypass that check - avoid the Storage::check call at all, if a appropriately configured.

"Appropriately configured" is not just e.g. GCS_RW_MODE (which is GCS-only anyway), but a new option.

The option will need to explicitly state "RO" or "RW", since the check output is explicitly used. Effectively, the option takes the place of the check call.

The next steps I'm looking at:

Poke into the configuration code, to see how to add a new option. (Seems fairly straightforward, though bikeshedding is always fun: SCCACHE_ASSUME_CACHE_RW_MODE=(RO,RW)?)
Build a reproduction harness, that has at least a high probability of hitting the limit. I have GCS access I can use, though I'm not sure what the current test setup is for e.g. CI tests; I see there's some examples in test/ that I'll look at, to try to make something that can be integrated.
Get a positive repro, and (several) negative repros with the new option.

cceckman commented 8 months ago

config.rs defines the CacheModeConfig type, which is Serialize, Deserialize, and Into<CacheMode>.

So I think we want an Option<CacheModeConfig> somewhere in Config, that the server pulls on startup; falling back to the result of .check() if the option is None.

I'm unclear on is where to land this. The Config type has cache: Option<CacheType> alongside e.g. dist: DistConfig; CacheType is an enum over the various possible backends. There's not really a place to hang an "apply to all backends" flag other than the top-level....

But I guess if I think of it as something like server_startup_timeout, i.e. a setting on server startup rather than on the cache, then it makes sense to put it "at the same level".

Xuanwo commented 8 months ago

There's not really a place to hang an "apply to all backends" flag other than the top-level....

Yes, I think placing it along with server_startup_timeout looks better.

cceckman commented 8 months ago

OK - I have been able to ~manually reproduce this issue, and confirm that #2133 is a fix. I don't think the test is quite production-quality though.

This test script configures sccache to (a) use my personal GCS bucket for testing, and (b) ASSUME_RW_MODE=READ_WRITE. It invokes this test case, which launches 100 sccache servers concurrently and runs show-stats on all of them.

When I ran the tests without my changes patched in, I saw this error:

storage write check failed: Unexpected (permanent) at Writer::close => GcsErrorResponse { error: GcsError { code: 429, message: "The object sccache-dev.cceckman.com/.sccache_check exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", errors: [GcsErrorDetail { domain: "usageLimits", location: "", location_type: "", message: "The object sccache-dev.cceckman.com/.sccache_check exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", reason: "rateLimitExceeded" }] } }

Context:
   uri: https://storage.googleapis.com/upload/storage/v1/b/sccache-dev.cceckman.com/o?uploadType=media&name=.sccache_check
   response: Parts { status: 429, version: HTTP/1.1, headers: {"x-guploader-uploadid": "ABPtcPqbL3v50VAMSqL6_ZjFxj2uDeoynIihcV-rphUg4aUrgdfKA-loIRJpUvbsJk84mL5ysnY", "content-type": "application/json; charset=UTF-8", "date": "Mon, 18 Mar 2024 20:41:27 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "content-length": "625", "server": "UploadServer", "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000"} }
   service: gcs
   path: .sccache_check

This appears to be from the server's stderr, which I have configured to stream to the same output. This message disappeared with my patch, and with the ASSUME_RW_MODE flag in the environment.

The test itself didn't fail, though, which means this isn't a regression test. I want to:

Look more carefully at the startup sequence, and see if there's a way to get around "we try to just keep going" to reliably detect this error. Can we say something about the show-stats call? Will get get a nonzero server exit status? ...can we do something other than just reading stderr?
Look more carefully at the test environment, to see how to port my storage_check_gcs.sh to use...hopefully there's an existing GCS bucket, service account / workload identity, etc. for CI workflows?

Xuanwo commented 8 months ago

cc @satyajeet104, would you like to give this PR https://github.com/mozilla/sccache/pull/2133 a try and give us some feedbacks?

blunderbussatya commented 8 months ago

Thanks a lot guys, look good to me!

cceckman commented 7 months ago

As far as I can tell from cd .github; rg SCCACHE_GCS, there's no CI testing of the GCS integration. My tests so far are against a personal bucket; if there's a Mozilla org, bucket, etc. with appropriate custodianship, then I'd integrate with that. @Xuanwo @sylvestre Do you know if any such thing exists?

Otherwise, the test strategy will have to be based on "some other storage" which can force the error. "force the error" would be.... error if .sccache_check is accessed but allow other ops through normally? (Getting really fancy with FUSE....)

Or, I guess, check that it assumes "write" mode on a read-only filesystem storage- and check that it fails when it "needs" to do a write for something. My next test: if I point --zero-stats at a storage where the user doesn't have write permissions, what happens? Would that let us distinguish the ASSUME_RW case ~cheaply (without a complicated harness)?

I'm admittedly getting a little lost in working out how to reliably test this >.<

Xuanwo commented 7 months ago

Otherwise, the test strategy will have to be based on "some other storage" which can force the error. "force the error" would be.... error if .sccache_check is accessed but allow other ops through normally? (Getting really fancy with FUSE....)

We can test against s3 storage services. They should work like the same.

My next test: if I point --zero-stats at a storage where the user doesn't have write permissions, what happens?

zero-stats just clean up the in-memory stats, no changes to storage.

mozilla / sccache