mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training
https://streaming.docs.mosaicml.com
Apache License 2.0
1.13k stars 142 forks source link

Failed to merge index on multiple MDS on cloudflare R2 #589

Closed atamano closed 9 months ago

atamano commented 9 months ago

Environment

To reproduce

  1. export R2 env variables

    export AWS_SECRET_ACCESS_KEY=MY_R2_SECRET
    export AWS_ACCESS_KEY_ID=MY_R2_KEY
    export S3_ENDPOINT_URL="https://XXX.r2.cloudflarestorage.com"
    export AWS_REGION="enam"
  2. call merge function

from streaming.base.util import merge_index

paths = [
"s3://my_dataset/batch1/index.json",
"s3://my_dataset/batch2/index.json",
]
merge_index(paths, out="./out", keep_local=True)

results in

botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

Expected behavior

expected index.json to be created

Additional context

Stacktrace:

Traceback (most recent call last):
  File "XXX/streaming/base/util.py", line 319, in _merge_index_from_list
    download_file(src, dest, download_timeout)
  File "XXX/streaming/base/storage/download.py", line 461, in download_file
    download_from_s3(remote, local, timeout)
  File "XXX/streaming/base/storage/download.py", line 116, in download_from_s3
    _download_file(unsigned=True, extra_args=extra_args)
  File "XXX/streaming/base/storage/download.py", line 75, in _download_file
    s3.download_file(obj.netloc,
  File "XXX/boto3/s3/inject.py", line 192, in download_file
    return transfer.download_file(
  File "XXX/boto3/s3/transfer.py", line 405, in download_file
    future.result()
  File "XXX/s3transfer/futures.py", line 103, in result
    return self._coordinator.result()
  File "XXX/s3transfer/futures.py", line 266, in result
    raise self._exception
  File "XXX/s3transfer/tasks.py", line 269, in _main
    self._submit(transfer_future=transfer_future, **kwargs)
  File "XXX/s3transfer/download.py", line 354, in _submit
    response = client.head_object(
  File "XXX/botocore/client.py", line 553, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "XXX/botocore/client.py", line 1009, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

According to their documentation, the boto3 client should be created that way to work:

s3 = boto3.client(
    service_name ="s3",
    endpoint_url = 'https://<accountid>.r2.cloudflarestorage.com',
    aws_access_key_id = '<access_key_id>',
    aws_secret_access_key = '<access_key_secret>',
    region_name="<location>", # Must be one of: wnam, enam, weur, eeur, apac, auto
)

https://developers.cloudflare.com/r2/examples/aws/boto3/

Thats not the case here:

https://github.com/mosaicml/streaming/blob/bc72659cc84e57721d36776267ab4cfa750a9ed1/streaming/base/storage/download.py#L68

xiaohanzhan-db commented 9 months ago

@atamano thanks for bringing up the issue. We are working on refactoring the object store. So it is requiring the "region_name" parameter?

atamano commented 9 months ago

Yes the missing parameter is the region_name

atamano commented 9 months ago

my bad, the correct environment variable boto3 expects is AWS_DEFAULT_REGION, I'm closing this issue