ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.04k stars 5.59k forks source link

[data] AWS ACCESS_DENIED errors due to transient network issues #47230

Open raulchen opened 3 weeks ago

raulchen commented 3 weeks ago

Sometimes we get access_denied errors when read task concurrency is high or when network is unstable.
This can happen even when credentials are properly set. E.g. in some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response.

Currently we don't retry on ACCESS_DENIED errors because we cannot distinguish transient errors from real authentication errors. In both cases, we all get OSError: When getting information for key '...' in bucket '...': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

When this happens, reducing concurrency may help. If you are sure about your credential setup, another solution is to manually add ACCESS_DENIED error to the retry list.

See comment below for a potential workaround.

raulchen commented 3 weeks ago

Another option is we retry ACCESS_DENIED for read tasks, but don't retry it for metadata fetching tasks. Because if it's a real authentication issue, the metadata fetching tasks will first raise the error. But this can still cause confusion if the user has permission for the directory, but not for a file in the directory.

raulchen commented 3 weeks ago

A more detailed explanation from ChatGPT of why this error can be mistakenly raised

The “ACCESS_DENIED” error message during network operations like HeadObject can sometimes be misleading when dealing with intermittent network issues. Here’s why this might happen:

1. Network Interruption Leading to Misinterpretation

    •   Inconsistent Connectivity: If there’s a brief network glitch or packet loss, the request might not reach the AWS server correctly, or the response might not be received properly. When the request is incomplete or garbled due to the network issue, the server might not be able to authenticate it correctly, causing it to respond with an “ACCESS_DENIED” error.
    •   Fallback Errors: In some cases, if the AWS service doesn’t receive enough information to determine the actual error, it might default to an “ACCESS_DENIED” response. This is a conservative approach to avoid accidentally exposing resources.

2. Timeouts Misinterpreted as Access Issues

    •   Timeout Handling: When a request times out due to network issues, the client library (pyarrow in this case) might interpret the lack of a proper response as an “ACCESS_DENIED” error because it didn’t receive the expected authorization confirmation from the server.
    •   Partial Responses: Sometimes, the client might receive a partial response before the connection drops. If the response lacks the necessary authentication data, it could be interpreted as an access denial rather than a network failure.

3. Boto3 or Pyarrow Error Handling

    •   Error Mapping: The boto3 or pyarrow library might map certain low-level network errors to higher-level errors like “ACCESS_DENIED” if the error occurs during a critical authentication step. This can be due to how the libraries abstract away the complexity of handling AWS responses.
    •   Inconsistent Error Messages: The error handling mechanism in these libraries may not always distinguish clearly between an access denial and a network issue, especially if the error occurs at a point where access checks are involved.

4. Load Balancer or CDN Issues

    •   AWS Infrastructure: If AWS’s load balancers or edge nodes experience brief issues, the requests might be routed in ways that cause them to fail. In such cases, the error might be incorrectly flagged as an access issue when it’s actually a transient infrastructure problem.

5. DNS Resolution Problems

    •   DNS Resolution Failures: If there’s an intermittent DNS resolution failure, the request might not reach the correct endpoint, leading to an incorrect “ACCESS_DENIED” response due to a failure in resolving the proper S3 bucket URL.
raulchen commented 3 weeks ago

linking a related issue https://github.com/ray-project/ray/issues/42153

scottjlee commented 2 weeks ago

The current theory behind the root cause is that the original credentials become unavailable in the middle of execution, possibly due to a pyarrow.fs bug. The suggested workaround is to explicitly define a filesystem using credentials generated with boto3 , and pass it to the read method you are using. For example:

def get_s3fs_with_boto_creds():
    import boto3
    from pyarrow import fs

    credentials = boto3.Session().get_credentials()

    s3fs = fs.S3FileSystem(
        access_key=credentials.access_key,
        secret_key=credentials.secret_key,
        session_token=credentials.token,
        region=...,
    )
    return s3fs

fs = get_s3fs_with_boto_creds()
ds = ray.data.read_images(..., filesystem=fs)

Potential downsides for this workaround are: