mozilla / sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Apache License 2.0
5.85k stars 552 forks source link

Hitting rate limiting issues due to .sccache_check write checks on GCS #2132

Open blunderbussatya opened 8 months ago

blunderbussatya commented 8 months ago

We're using sccache v0.7.6, we have a lot of builds running in parallel and each run invokes sccache which in turn tries to do a sccache RW check by reading and writing .sccache_check dummy file. While this is fine for reading but for writing we violate the quota on GCS which allows write on the same object once per second.

Is there a way to bypass this check? We can perhaps create an env var to support this use case?

Xuanwo commented 8 months ago

Yes, we can include an option to bypass the storage check if users verify their setup is accurate.

Xuanwo commented 8 months ago

I'm willing to provide mentorship for this issue.

Task

Implement an option to bypass the storage check if users verify their setup is accurate.

Mentorship

@xuanwo is willing to offer mentorship to help address this issue effectively. @xuanwo promises to respond to your questions regarding the implementation of this feature within 24 hours, excluding weekends.

After finishing this issue, you will

Notes

cceckman commented 8 months ago

I would be interested!

My contact info is at cceckman.com.

Xuanwo commented 8 months ago

I would be interested!

Thank you very much for your interest! Mails coming in.

cceckman commented 8 months ago

Some notes from investigation & discussion with @Xuanwo so far:

The next steps I'm looking at:

cceckman commented 8 months ago

config.rs defines the CacheModeConfig type, which is Serialize, Deserialize, and Into<CacheMode>.

So I think we want an Option<CacheModeConfig> somewhere in Config, that the server pulls on startup; falling back to the result of .check() if the option is None.

I'm unclear on is where to land this. The Config type has cache: Option<CacheType> alongside e.g. dist: DistConfig; CacheType is an enum over the various possible backends. There's not really a place to hang an "apply to all backends" flag other than the top-level....

But I guess if I think of it as something like server_startup_timeout, i.e. a setting on server startup rather than on the cache, then it makes sense to put it "at the same level".

Xuanwo commented 8 months ago

There's not really a place to hang an "apply to all backends" flag other than the top-level....

Yes, I think placing it along with server_startup_timeout looks better.

cceckman commented 8 months ago

OK - I have been able to ~manually reproduce this issue, and confirm that #2133 is a fix. I don't think the test is quite production-quality though.

This test script configures sccache to (a) use my personal GCS bucket for testing, and (b) ASSUME_RW_MODE=READ_WRITE. It invokes this test case, which launches 100 sccache servers concurrently and runs show-stats on all of them.

When I ran the tests without my changes patched in, I saw this error:

storage write check failed: Unexpected (permanent) at Writer::close => GcsErrorResponse { error: GcsError { code: 429, message: "The object sccache-dev.cceckman.com/.sccache_check exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", errors: [GcsErrorDetail { domain: "usageLimits", location: "", location_type: "", message: "The object sccache-dev.cceckman.com/.sccache_check exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429.", reason: "rateLimitExceeded" }] } }

Context:
   uri: https://storage.googleapis.com/upload/storage/v1/b/sccache-dev.cceckman.com/o?uploadType=media&name=.sccache_check
   response: Parts { status: 429, version: HTTP/1.1, headers: {"x-guploader-uploadid": "ABPtcPqbL3v50VAMSqL6_ZjFxj2uDeoynIihcV-rphUg4aUrgdfKA-loIRJpUvbsJk84mL5ysnY", "content-type": "application/json; charset=UTF-8", "date": "Mon, 18 Mar 2024 20:41:27 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "content-length": "625", "server": "UploadServer", "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000"} }
   service: gcs
   path: .sccache_check

This appears to be from the server's stderr, which I have configured to stream to the same output. This message disappeared with my patch, and with the ASSUME_RW_MODE flag in the environment.

The test itself didn't fail, though, which means this isn't a regression test. I want to:

Xuanwo commented 8 months ago

cc @satyajeet104, would you like to give this PR https://github.com/mozilla/sccache/pull/2133 a try and give us some feedbacks?

blunderbussatya commented 8 months ago

Thanks a lot guys, look good to me!

cceckman commented 7 months ago

As far as I can tell from cd .github; rg SCCACHE_GCS, there's no CI testing of the GCS integration. My tests so far are against a personal bucket; if there's a Mozilla org, bucket, etc. with appropriate custodianship, then I'd integrate with that. @Xuanwo @sylvestre Do you know if any such thing exists?

Otherwise, the test strategy will have to be based on "some other storage" which can force the error. "force the error" would be.... error if .sccache_check is accessed but allow other ops through normally? (Getting really fancy with FUSE....)

Or, I guess, check that it assumes "write" mode on a read-only filesystem storage- and check that it fails when it "needs" to do a write for something. My next test: if I point --zero-stats at a storage where the user doesn't have write permissions, what happens? Would that let us distinguish the ASSUME_RW case ~cheaply (without a complicated harness)?

I'm admittedly getting a little lost in working out how to reliably test this >.<

Xuanwo commented 7 months ago

Otherwise, the test strategy will have to be based on "some other storage" which can force the error. "force the error" would be.... error if .sccache_check is accessed but allow other ops through normally? (Getting really fancy with FUSE....)

We can test against s3 storage services. They should work like the same.

My next test: if I point --zero-stats at a storage where the user doesn't have write permissions, what happens?

zero-stats just clean up the in-memory stats, no changes to storage.