NoCredentialsError when accessing public S3 bucket for train data #9

Open StefanHangler opened 1 month ago

StefanHangler commented 1 month ago

Hey everyone,

First of all, thank you for the excellent work on this project! I am currently doing some tests and attempting to retrain the model from scratch to compare it with other datasets and benchmarks. Unfortunately, I encountered an issue while trying to download the 350GB dataset from your public S3 bucket. Despite the bucket supposedly being public, I am faced with a NoCredentialsError (full Error Code at the bottom).

I am curious if there have been recent changes to the data's location or the public access settings on the bucket. Any assistance or updates you could provide would be greatly appreciated.

Thank you for your help!

Best regards, Stefan Hangler

alexterray commented 1 month ago

Hey Stefan, thanks for the kind words and giving this a shot! I think I see what's up on the COATI dataset access here.

So I can confirm that the dataset is still configured to be public in our S3, but it looks like CLI (and presumably boto3 access) still require credentials to sign all S3 requests programatically regardless of the permissions on the bucket/objects. Here's what I tested.

# Confirm this server does not have any AWS credentials
$ aws sts get-caller-identity
Unable to locate credentials. You can configure credentials by running "aws configure".

# And it appears to be unhappy without any credentials present
$ aws s3 cp s3://terray-public/datasets/coati_data/0.pkl .
fatal error: Unable to locate credentials

# Let's verify the bucket contents are indeed public over HTTP
$ wget https://terray-public.s3.us-west-2.amazonaws.com/datasets/coati_data/0.pkl
--2024-06-17 09:40:03--  https://terray-public.s3.us-west-2.amazonaws.com/datasets/coati_data/0.pkl
Resolving terray-public.s3.us-west-2.amazonaws.com (terray-public.s3.us-west-2.amazonaws.com)...,,, ...
Connecting to terray-public.s3.us-west-2.amazonaws.com (terray-public.s3.us-west-2.amazonaws.com)||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74265858 (71M) [binary/octet-stream]
Saving to: ‘0.pkl’

0.pkl                   100%[=============================>]  70.83M  8.27MB/s    in 8.7s    

2024-06-17 09:40:12 (8.17 MB/s) - ‘0.pkl’ saved [74265858/74265858]
# ...and it seems to be the case.
# I can verify the HTTP URL also works on an incognito browser window

Here are a couple options your getting your access

  1. Run aws configure with any valid credentials you may have on an AWS tenant. The permissions of that user should be irrelevant for the public dataset, but will at least provide valid data to sign your requests with
  2. Disable the boto3 signing step, which seems to vary depending on your version. Check out this StackOverflow answer that has a few different suggestions (that may also vary depending if you use boto3.client vs boto3.resource): https://stackoverflow.com/questions/34865927/can-i-use-boto3-anonymously/34866092#34866092
StefanHangler commented 1 month ago

Thanks for the quick and detailed response! That was very helpful.

I managed to work around the credential issue with a boto3 setup to bypass the signing step. Here’s the code snippet that ended up working for me with boto3 version 1.9.251:

import boto3

def download_public_file(bucket_name, object_key, download_path):
    s3_client = boto3.client('s3', aws_access_key_id='', aws_secret_access_key='', region_name='us-west-2')
    s3_client._request_signer.sign = (lambda *args, **kwargs: None)

        s3_client.download_file(bucket_name, object_key, download_path)
        print(f"Downloaded {object_key} to {download_path}")
    except Exception as e:
        print(f"Failed to download {object_key}: {e}")

# Usage
bucket_name = 'terray-public'
object_key = 'datasets/coati_data/0.pkl'
download_path = 'path/to/store/file/0.pkl'
download_public_file(bucket_name, object_key, download_path)

I also noticed that the file 0.pkl is around 75MB, which seems much smaller than expected. The dataset.py mentions that the dataset should be around 340GB. Could you confirm if 0.pkl is part of a larger set of files or if there's another location where the full dataset is stored?

Thanks again for your assistance!

Best, Stefan