pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.32k stars 636 forks source link

remote cache ResourceExhausted/OutOfRange errors #20674

Open jasonwbarnett opened 8 months ago

jasonwbarnett commented 8 months ago

Describe the bug

When remote caching is enabled and you use adhoc_tool that outputs a directory with a ton of files (i.e. 180K), pants attempts to call FindMissingBlobs (/build.bazel.remote.execution.v2.ContentAddressableStorage/FindMissingBlobs) and the message exceeds 4MB which leads to a ResourceExhausted or OutOfRange error depending on the configuration of the remote caching service. In my case I am using bazel-remote. I started by modifying bazel-remote's MaxRecvMsgSize which revealed a different error (OutOfRange).

Before increasing MaxRecvMsgSize in bazel-remote

[WARN] Failed to write to remote cache (1 occurrences so far): ResourceExhausted: "grpc: received message larger than max (5790951 vs. 4194304)"

This error message is coming from bazel-remote.

After increasing MaxRecvMsgSize in bazel-remote

[WARN] Failed to write to remote cache (1 occurrences so far): OutOfRange: "Error, message length too large: found 5790600 bytes, the limit is: 4194304 bytes"

This error is coming from tonic. I validated this by modifying tonic and updating pants to use a branch on my tonic fork. I ended up submitting this PR: https://github.com/hyperium/tonic/pull/1658.

I took this one step further. I updated tonic's DEFAULT_MAX_RECV_MESSAGE_SIZE (jwb/set-max-message-size-very-high) and validated that everything functions correctly after I do that. Which to be clear, both the client (i.e. pants via tonic) and the server (i.e. bazel-remote) needed to be modified to support the large FindMissingBlobs message.

I see at least two possible solutions:

  1. Add a configuration option to pants which controls tonic's max_decoding_message_size. To be clear, I'm not sure this there is a relationship between max_decoding_message_size and DEFAULT_MAX_RECV_MESSAGE_SIZE, but I'm hoping so 🤞
  2. Update pants so that chunks or streams the FindMissingBlobs calls to the server.

Pants version

2.19.0

OS

Both Linux and Mac OS.

jasonwbarnett commented 7 months ago

I just tested this and confirmed it works instead of modifying tonic's DEFAULT_MAX_RECV_MESSAGE_SIZE constant.

diff --git a/src/rust/engine/remote_provider/remote_provider_reapi/src/byte_store.rs b/src/rust/engine/remote_provider/remote_provider>
index 839ee07d26..e9f7bb6955 100644
--- a/src/rust/engine/remote_provider/remote_provider_reapi/src/byte_store.rs
+++ b/src/rust/engine/remote_provider/remote_provider_reapi/src/byte_store.rs
@@ -92,9 +92,10 @@ impl Provider {
             Some((options.timeout, Metric::RemoteStoreRequestTimeouts)),
         );

+        let limit = 25 * 1024 * 1024;
         let byte_stream_client = Arc::new(ByteStreamClient::new(channel.clone()));

-        let cas_client = Arc::new(ContentAddressableStorageClient::new(channel.clone()));
+        let cas_client = Arc::new(ContentAddressableStorageClient::new(channel.clone()).max_decoding_message_size(limit));

         let capabilities_client = Arc::new(CapabilitiesClient::new(channel));
benjyw commented 7 months ago

Thanks for the legwork @jasonwbarnett . So do we want to add an option to control max_decoding_message_size? That wouldn't be too hard to do.

jasonwbarnett commented 7 months ago

@benjyw Yes, please. That would be amazing!

tgolsson commented 7 months ago

Wouldn't it be better to stick to defaults and do multiple requests if the payload is too large? It's my preferred solution when possible, since it ensures compatibility with all servers.

jasonwbarnett commented 7 months ago

@tgolsson 100% -- But if that is not going to happen because it's say 10x the effort to implement, I'd rather have something that allows me to be unblocked than to continue in a position where I have no viable options. Plus I still think there are legitimate use cases to increase the maximum message size. I believe it's more efficient if you have a lot of large messages.

That said, my vote would be to add the functionality that allows tuning the maximum message size because I think it's valuable both tactically and strategically for certain situations. I also think it's of even more value to update the code paths to chunk messages into appropriate sizes based on the maximum message size set which would make this compatible with every server, as mentioned.

tgolsson commented 7 months ago

Depending on content, it's fairly trivial to chunk a too-big request. Might be more efficient too, since the remote server could process all smaller chunks in parallel, and you can decode responses in parallel. In the case of FindMissingBlobs, since the digest is fixed size we know exactly how many digests can fit in 4MB, and can chunk it appropriately.

tgolsson commented 7 months ago

@jasonwbarnett if you have a repro, can you validate https://github.com/pantsbuild/pants/pull/20708 fixes it?

jasonwbarnett commented 7 months ago

@jasonwbarnett if you have a repro, can you validate https://github.com/pantsbuild/pants/pull/20708 fixes it?

I totally missed this. Let me test tomorrow and let you know.

jasonwbarnett commented 6 months ago

@tgolsson I tested https://github.com/pantsbuild/pants/pull/20708 and never saw any errors! 🎉

tgolsson commented 6 months ago

Thanks! I've been terribly busy lately but hopefully I'll get around to merging it soon.