Open StefanHangler opened 1 month ago
Hey Stefan, thanks for the kind words and giving this a shot! I think I see what's up on the COATI dataset access here.
So I can confirm that the dataset is still configured to be public in our S3, but it looks like CLI (and presumably boto3 access) still require credentials to sign all S3 requests programatically regardless of the permissions on the bucket/objects. Here's what I tested.
# Confirm this server does not have any AWS credentials
$ aws sts get-caller-identity
Unable to locate credentials. You can configure credentials by running "aws configure".
# And it appears to be unhappy without any credentials present
$ aws s3 cp s3://terray-public/datasets/coati_data/0.pkl .
fatal error: Unable to locate credentials
# Let's verify the bucket contents are indeed public over HTTP
$ wget https://terray-public.s3.us-west-2.amazonaws.com/datasets/coati_data/0.pkl
--2024-06-17 09:40:03-- https://terray-public.s3.us-west-2.amazonaws.com/datasets/coati_data/0.pkl
Resolving terray-public.s3.us-west-2.amazonaws.com (terray-public.s3.us-west-2.amazonaws.com)... 52.92.180.2, 52.218.181.25, 3.5.87.208, ...
Connecting to terray-public.s3.us-west-2.amazonaws.com (terray-public.s3.us-west-2.amazonaws.com)|52.92.180.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74265858 (71M) [binary/octet-stream]
Saving to: ‘0.pkl’
0.pkl 100%[=============================>] 70.83M 8.27MB/s in 8.7s
2024-06-17 09:40:12 (8.17 MB/s) - ‘0.pkl’ saved [74265858/74265858]
# ...and it seems to be the case.
# I can verify the HTTP URL also works on an incognito browser window
Here are a couple options your getting your access
aws configure
with any valid credentials you may have on an AWS tenant. The permissions of that user should be irrelevant for the public dataset, but will at least provide valid data to sign your requests withThanks for the quick and detailed response! That was very helpful.
I managed to work around the credential issue with a boto3 setup to bypass the signing step. Here’s the code snippet that ended up working for me with boto3 version 1.9.251:
import boto3
def download_public_file(bucket_name, object_key, download_path):
s3_client = boto3.client('s3', aws_access_key_id='', aws_secret_access_key='', region_name='us-west-2')
s3_client._request_signer.sign = (lambda *args, **kwargs: None)
try:
s3_client.download_file(bucket_name, object_key, download_path)
print(f"Downloaded {object_key} to {download_path}")
except Exception as e:
print(f"Failed to download {object_key}: {e}")
# Usage
bucket_name = 'terray-public'
object_key = 'datasets/coati_data/0.pkl'
download_path = 'path/to/store/file/0.pkl'
download_public_file(bucket_name, object_key, download_path)
I also noticed that the file 0.pkl
is around 75MB, which seems much smaller than expected. The dataset.py
mentions that the dataset should be around 340GB. Could you confirm if 0.pkl
is part of a larger set of files or if there's another location where the full dataset is stored?
Thanks again for your assistance!
Best, Stefan
Hey everyone,
First of all, thank you for the excellent work on this project! I am currently doing some tests and attempting to retrain the model from scratch to compare it with other datasets and benchmarks. Unfortunately, I encountered an issue while trying to download the 350GB dataset from your public S3 bucket. Despite the bucket supposedly being public, I am faced with a
NoCredentialsError
(full Error Code at the bottom).I am curious if there have been recent changes to the data's location or the public access settings on the bucket. Any assistance or updates you could provide would be greatly appreciated.
Thank you for your help!
Best regards, Stefan Hangler