Open huonw opened 1 year ago
Brainstorming some potential ways to improve this:
Tree
s instead of directories recursively referencing more directories (Stu's idea in https://pantsbuild.slack.com/archives/C0D7TNJHL/p1684812802859609?thread_ts=1684809167.631159&cid=C0D7TNJHL)I've asked support and they said:
actions/cache
for more info: https://github.com/actions/cache/issues/1280The #20138 metrics look like this in https://github.com/huonw/pants-cache-github-actions-poc/tree/huge. In particular, there's a lot of remote_store_exists_attempts
that could potentially be skipped in favour of just optimistic reads/writes? I suspect it won't be enough, though.
remote_cache_read_errors: 0
remote_cache_request_timeouts: 0
remote_cache_requests: 200
remote_cache_requests_cached: 55
remote_cache_requests_uncached: 138
remote_cache_write_attempts: 145
remote_cache_write_errors: 0
remote_cache_write_successes: 15
remote_store_exists_attempts: 7171
remote_store_exists_errors: 0
remote_store_exists_successes: 4444
remote_store_missing_digest: 0
remote_store_read_attempts: 2929
remote_store_read_cached: 2315
remote_store_read_errors: 0
remote_store_read_uncached: 0
remote_store_request_timeouts: 0
remote_store_write_attempts: 3811
remote_store_write_errors: 0
remote_store_write_successes: 1222
Any updates ?
We are getting the rate limit response quite often
Failed to write to remote cache (1 occurrences so far): failed to write bytes to byte-store/f7/96/f796d90683a5d2eb70012e696ad1cc6f5abbc0cd45055a885f8c0869bec45c19: RateLimited (temporary) at Writer::close, context: { uri: https://acghubeus1.actions.githubusercontent.com/XOBKJTEeAYy204PqRtoOovZrR7W6T39UkmYry2K2ZH3xteFzM8/_apis/artifactcache/caches/283150, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Mon, 20 May 2024 08:30:54 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "6b4507dd-bb87-48e4-9172-1cc478e2d4e8", "activityid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-tfs-session": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-e2eid": "19275ed8-68b6-492f-8c72-23792dc4fa87", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Long", "x-ratel
No change that I know of.
Describe the bug
The new experimental remote cache backend for github actions (#19831) seems to easily hit rate limits in practice, presumably because it's designed for coarse grained caching of larger artefacts rather than the very fine grained caching of individual items that Pants uses it for.
For instance (full messages below):
[WARN] Failed to read from remote cache (1 occurrences so far): failed to read ...: RateLimited (persistent) at read, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
[WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to ...: RateLimited (persistent) at write, context: ... {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}
Click to expand for example full error messages
read: ```09:55:54.25 [WARN] Failed to read from remote cache (1 occurrences so far): failed to read huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a: RateLimited (persistent) at read, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/cache?keys=huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a&version=pants-1, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:55:53 GMT", "server": "Kestrel", "retry-after": "60", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "176cbe80-c272-43ec-99d8-58b95bda99a0", "activityid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-tfs-session": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-e2eid": "0f8fe3f9-9df0-4fec-abdd-f6d8e254b015", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "1623", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, service: ghac, path: huge/action-cache/2a/50/2a505b2fedb6b55150a17913680431218ad998fce87fd6e91007ac2cb520247a, range: 0- } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}``` write: ```09:56:01.76 [WARN] Failed to write to remote cache (1 occurrences so far): failed to write bytes to huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b: RateLimited (persistent) at write, context: { uri: https://acghubeus1.actions.githubusercontent.com/swYrhypG2MJmMQxwjZTKUlTTHNgLiRmyMXAqEruwvhFVEC0MZe/_apis/artifactcache/caches, response: Parts { status: 429, version: HTTP/1.1, headers: {"content-type": "application/json", "date": "Wed, 01 Nov 2023 09:56:01 GMT", "server": "Kestrel", "retry-after": "286", "transfer-encoding": "chunked", "strict-transport-security": "max-age=2592000", "x-tfs-processid": "1fb111d5-08ce-442f-ad9a-1c9c1c64f68f", "activityid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-tfs-session": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-e2eid": "bcaa7044-b379-48aa-aeb5-3917514ee9be", "x-vss-senderdeploymentid": "e3de3c8c-fa12-318e-3301-e1d3e326b2e8", "x-frame-options": "SAMEORIGIN", "x-ratelimit-resource": "ArtifactCache/Short", "x-ratelimit-limit": "734", "x-ratelimit-remaining": "0", "x-ratelimit-reset": "1698832717", "x-tfs-serviceerror": "Request+was+blocked+due+to+exceeding+usage+of+resource+%27Count%27+in+namespace+%27%27."} }, called: Backend::ghac_reserve, service: ghac, path: huge/byte-store/a7/22/a7227d8f74ae3772c6e65f9cd738c5fb8abf6ad5e97126eb7783e4ab6a79224b } => {"$id":"1","innerException":null,"message":"Request was blocked due to exceeding usage of resource 'Count' in namespace ''.","typeName":"Microsoft.TeamFoundation.Framework.Server.RequestBlockedException, Microsoft.TeamFoundation.Framework.Server","typeKey":"RequestBlockedException","errorCode":0,"eventId":3000}``` (NB. inline code formatting to get better line breaking)This means it seems to work well for small examples, but falls over as they become larger. Thus, mitigations are likely required.
For instance, https://github.com/huonw/pants-cache-github-actions-poc/tree/cf5c53b368f0b5229f103465f25005b6db465988 (on the
huge
branch) sets up 100 codegen targets, that generate 100 files each, and then a corresponding test so thatpants test ::
will force the codegen targets to do something. This means 200 process executions and 10k files need to be uploaded or downloaded from the GHA cache in quick succession.A build that has a (partially) seeded remote cache quickly hits errors: https://github.com/huonw/pants-cache-github-actions-poc/actions/runs/6718179818/job/18257496697
Setting
remote_store_rpc_concurrency = 1
(down from the default of 128) didn't stop this occurring in my work repo, i.e. maxed-out sequential interactions seemed to be enough to hit the rate limits too. (It did appear to help with the reduced example above.)Picking out headers from those messages that might seem relevant:
x-ratelimit-limit
header has values 1623, 734. I've seen it as high as ~3000 in my work repo.x-ratelimit-remaining
header unsurprisingly always has value 0x-ratelimit-reset
(unix timestamp?) andretry-after
(seconds to wait?) headers don't exactly align, but they're both order of minutes to wait, e.g. for the first one, the request ran at2023-11-01T09:55:53+00:00
and the rate limiting says:x-ratelimit-reset
:2023-11-01T09:58:37+00:00
(datetime.fromtimestamp(1698832717).astimezone(timezone.utc).isoformat()
)date
+retry-after
:2023-11-01T09:56:53+00:00
((datetime(2023, 11, 1, 9, 55, 53, tzinfo=timezone.utc) + timedelta(seconds=60)).isoformat()
)Info about rate limits and other similar tools experience:
Pants version 2.19.0.dev4
OS GHA ubuntu runner
Additional info N/A