Open raulchen opened 3 weeks ago
Another option is we retry ACCESS_DENIED for read tasks, but don't retry it for metadata fetching tasks. Because if it's a real authentication issue, the metadata fetching tasks will first raise the error. But this can still cause confusion if the user has permission for the directory, but not for a file in the directory.
A more detailed explanation from ChatGPT of why this error can be mistakenly raised
The “ACCESS_DENIED” error message during network operations like HeadObject can sometimes be misleading when dealing with intermittent network issues. Here’s why this might happen:
1. Network Interruption Leading to Misinterpretation
• Inconsistent Connectivity: If there’s a brief network glitch or packet loss, the request might not reach the AWS server correctly, or the response might not be received properly. When the request is incomplete or garbled due to the network issue, the server might not be able to authenticate it correctly, causing it to respond with an “ACCESS_DENIED” error.
• Fallback Errors: In some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response. This is a conservative approach to avoid accidentally exposing resources.
2. Timeouts Misinterpreted as Access Issues
• Timeout Handling: When a request times out due to network issues, the client library (pyarrow in this case) might interpret the lack of a proper response as an “ACCESS_DENIED” error because it didn’t receive the expected authorization confirmation from the server.
• Partial Responses: Sometimes, the client might receive a partial response before the connection drops. If the response lacks the necessary authentication data, it could be interpreted as an access denial rather than a network failure.
3. Boto3 or Pyarrow Error Handling
• Error Mapping: The boto3 or pyarrow library might map certain low-level network errors to higher-level errors like “ACCESS_DENIED” if the error occurs during a critical authentication step. This can be due to how the libraries abstract away the complexity of handling AWS responses.
• Inconsistent Error Messages: The error handling mechanism in these libraries may not always distinguish clearly between an access denial and a network issue, especially if the error occurs at a point where access checks are involved.
4. Load Balancer or CDN Issues
• AWS Infrastructure: If AWS’s load balancers or edge nodes experience brief issues, the requests might be routed in ways that cause them to fail. In such cases, the error might be incorrectly flagged as an access issue when it’s actually a transient infrastructure problem.
5. DNS Resolution Problems
• DNS Resolution Failures: If there’s an intermittent DNS resolution failure, the request might not reach the correct endpoint, leading to an incorrect “ACCESS_DENIED” response due to a failure in resolving the proper S3 bucket URL.
linking a related issue https://github.com/ray-project/ray/issues/42153
The current theory behind the root cause is that the original credentials become unavailable in the middle of execution, possibly due to a pyarrow.fs
bug. The suggested workaround is to explicitly define a filesystem using credentials generated with boto3
, and pass it to the read method you are using. For example:
def get_s3fs_with_boto_creds():
import boto3
from pyarrow import fs
credentials = boto3.Session().get_credentials()
s3fs = fs.S3FileSystem(
access_key=credentials.access_key,
secret_key=credentials.secret_key,
session_token=credentials.token,
region=...,
)
return s3fs
fs = get_s3fs_with_boto_creds()
ds = ray.data.read_images(..., filesystem=fs)
Potential downsides for this workaround are:
Sometimes we get access_denied errors when read task concurrency is high or when network is unstable.
This can happen even when credentials are properly set. E.g. in some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response.
Currently we don't retry on ACCESS_DENIED errors because we cannot distinguish transient errors from real authentication errors. In both cases, we all get
OSError: When getting information for key '...' in bucket '...': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
When this happens, reducing concurrency may help. If you are sure about your credential setup, another solution is to manually add
ACCESS_DENIED
error to the retry list.See comment below for a potential workaround.