google / tensorstore

Library for reading and writing large multi-dimensional arrays.
https://google.github.io/tensorstore/
Other
1.34k stars 120 forks source link

Tensorstore key value store driver not Retrying on 429 Responses with S3 API #194

Open AdityaCohere opened 1 week ago

AdityaCohere commented 1 week ago

I'm using Tensorstore's key value store driver with OCI object storage using their S3 compatibility layer and am seeing that when the OCI Object Storage service sends 429s Tensorstore fails immediately without any retries. This was tested with debug logging enabled which also didn't show any retries happening either with the default value of 32 or with s3_request_retries set to a custom number. I also tried testing this @sjperkins 's PR here to see if that may enable retries to work here however I'm hitting bazel build issues on a M1 Mac of:

configurable attribute "content" in @com_github_aws_c_common//:write_config_h doesn't match this configuration. Would a default condition help?

I was wondering if there was anything I could do to debug the lack of retries further or any modifications I could make to force retries.

laramiel commented 1 week ago

The retry decision happens here (or one of the other methods in the file): https://github.com/google/tensorstore/blob/2a44dc7d9b08f6f46e597c01457efdf1e18e54db/tensorstore/kvstore/s3/s3_metadata.cc#L218

You could turn on logging. TENSORSTORE_VERBOSE_LOGGING=s3=1.

sjperkins commented 1 week ago

Unfortunately #149 doesn't integrate with any kvstore at present.

AdityaCohere commented 1 week ago

The retry decision happens here (or one of the other methods in the file):

https://github.com/google/tensorstore/blob/2a44dc7d9b08f6f46e597c01457efdf1e18e54db/tensorstore/kvstore/s3/s3_metadata.cc#L218

You could turn on logging. TENSORSTORE_VERBOSE_LOGGING=s3=1.

Yea, I've tried turning on Logging as well with no output specifying any retries. I see that the 429 response is received and nothing beyond that other than a failure. I will repro this and get the debug logs

AdityaCohere commented 1 week ago

Unfortunately #149 doesn't integrate with any kvstore at present.

Ah, this is unfortunate to hear, would it non-trivial to integrate?

sjperkins commented 1 week ago

Unfortunately #149 doesn't integrate with any kvstore at present.

Ah, this is unfortunate to hear, would it non-trivial to integrate?

It'd take some work, I guess the amount of code required would be smaller than the current s3 kvstore as it could defer to the C++ S3 SDK.

However, the C++ SDK is a fairly heavy dependency: the S3 C CRT looks much lighter. That's next, when I find the time.

laramiel commented 1 week ago

You could turn on logging. TENSORSTORE_VERBOSE_LOGGING=s3=1.

Yea, I've tried turning on Logging as well with no output specifying any retries. I see that the 429 response is received and nothing beyond that other than a failure. I will repro this and get the debug logs

With s3=1 you should see a line from s3_key_value_store.cc:435 ReadTask or s3_key_value_store.cc:717 WriteTask or such. Could you attach those lines?

Also, you could attempt to run localstack_test against OCI object storage.

git clone https://github.com/google/tensorstore.git
cd tenstorstore
bazelisk.py build tensorstore/kvstore/s3:localstack_test
./bazel-bin/tensorstore/kvstore/s3/localstack_test --tensorstore_verbose_logging=all=2 --localstack_endpoint=...  <other flags>
AdityaCohere commented 6 days ago

Hey, I've tried turning on logging and I don't seem to see any logs regarding s3 being printed. I get a rate limited error but no read/write tasks. I've tried TENSORSTORE_VERBOSE_LOGGING=s3=1 as well as TENSORSTORE_VERBOSE_LOGGING=all Is there anything else I can try?