pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

GitHub Actions Cache backend easily hits rate limit errors #20133

Open huonw opened 1 year ago

huonw commented 1 year ago

Describe the bug

The new experimental remote cache backend for github actions (#19831) seems to easily hit rate limits in practice, presumably because it's designed for coarse grained caching of larger artefacts rather than the very fine grained caching of individual items that Pants uses it for.

For instance (full messages below):

[WARN] Failed to read from remote cache (1 occurrences so far): failed to read ...: RateLimited (persistent) at read, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

[WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to ...: RateLimited (persistent) at write, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}

Click to expand for example full error messages read: ```09:55:54.25 [WARN] Failed to read from remote cache (1 occurrences so far): failed to read huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a: RateLimited (persistent) at read, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/cache?keys=huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a&version=pants-1, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:55:53 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "176cbe80-c272-43ec-99d8-58b95bda99a0", "activityid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-tfs-session": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-e2eid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "1623", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, service: ghac, path: huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a, range: 0- } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}``` write: ```09:56:01.76 [WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b: RateLimited (persistent) at write, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/caches, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:56:01 GMT", "server": "Kestrel", "retry-after": "286", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "1fb111d5-08ce-442f-ad9a-1c9c1c64f68f", "activityid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-tfs-session": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-e2eid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "734", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, called: Backend::ghac_reserve, service: ghac, path: huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}``` (NB. inline code formatting to get better line breaking)

This means it seems to work well for small examples, but falls over as they become larger. Thus, mitigations are likely required.

For instance, https://github.com/huonw/pants-cache-github-actions-poc/tree/cf5c53b368f0b5229f103465f25005b6db465988 (on the huge branch) sets up 100 codegen targets, that generate 100 files each, and then a corresponding test so that pants test :: will force the codegen targets to do something. This means 200 process executions and 10k files need to be uploaded or downloaded from the GHA cache in quick succession.

A build that has a (partially) seeded remote cache quickly hits errors: https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6718179818/job/18257496697

Setting remote_store_rpc_concurrency = 1 (down from the default of 128) didn't stop this occurring in my work repo, i.e. maxed-out sequential interactions seemed to be enough to hit the rate limits too. (It did appear to help with the reduced example above.)

Picking out headers from those messages that might seem relevant:

Info about rate limits and other similar tools experience:

Pants version 2.19.0.dev4

OS GHA ubuntu runner

Additional info N/A

huonw commented 1 year ago

Brainstorming some potential ways to improve this:

  1. use the "GET API" somehow (no idea how, yet)
  2. reduce the number of requests by aggregating more blobs, particularly in the byte store:
    1. directories: upload whole Trees instead of directories recursively referencing more directories (Stu's idea in https://pantsbuild.slack.com/archives/C0D7TNJHL/p1684812802859609?thread_ts=1684809167.631159&cid=C0D7TNJHL)
    2. file blobs: batch blobs into a single upload somehow, e.g. all (small-enough) descendants of a single directory, keyed by some aggregate digest of the directory and their parent (NB. this has to be an obvious/reproducible algorithm, so that the corresponding digest can be computed when reading, with less information available than when writing)
  3. Magically increase GitHub's rate limits
huonw commented 11 months ago

I've asked support and they said:

huonw commented 11 months ago

The #20138 metrics look like this in https://github.com/huonw/pants-cache-github-actions-poc/tree/huge. In particular, there's a lot of remote_store_exists_attempts that could potentially be skipped in favour of just optimistic reads/writes? I suspect it won't be enough, though.

  remote_cache_read_errors: 0
  remote_cache_request_timeouts: 0
  remote_cache_requests: 200
  remote_cache_requests_cached: 55
  remote_cache_requests_uncached: 138
  remote_cache_write_attempts: 145
  remote_cache_write_errors: 0
  remote_cache_write_successes: 15
  remote_store_exists_attempts: 7171
  remote_store_exists_errors: 0
  remote_store_exists_successes: 4444
  remote_store_missing_digest: 0
  remote_store_read_attempts: 2929
  remote_store_read_cached: 2315
  remote_store_read_errors: 0
  remote_store_read_uncached: 0
  remote_store_request_timeouts: 0
  remote_store_write_attempts: 3811
  remote_store_write_errors: 0
  remote_store_write_successes: 1222
danikv commented 5 months ago

Any updates ? We are getting the rate limit response quite often Failed to write to remote cache (1 occurrences so far): failed to write bytes to byte-store/f7/96/f796d90683a5d2eb70012e696ad1cc6f5abbc0cd45055a885f8c0869bec45c19: RateLimited (temporary) at Writer::close, context: { uri: https://acghubeus1.actions.githubusercontent.com/XOBKJTEeAYy204PqRtoOovZrR7W6T39UkmYry2K2ZH3xteFzM8/_apis/artifactcache/caches/283150, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Mon, 20 May 2024 08:30:54 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "6b4507dd-bb87-48e4-9172-1cc478e2d4e8", "activityid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-tfs-session": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-e2eid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Long", "x-ratel

huonw commented 5 months ago

No change that I know of.