terraytherapeutics / COATI

COATI: multi-modal contrastive pre-training for representing and traversing chemical space
Apache License 2.0
88 stars 5 forks source link

NoCredentialsError when accessing public S3 bucket for train data #9

Open StefanHangler opened 1 month ago

StefanHangler commented 1 month ago

Hey everyone,

First of all, thank you for the excellent work on this project! I am currently doing some tests and attempting to retrain the model from scratch to compare it with other datasets and benchmarks. Unfortunately, I encountered an issue while trying to download the 350GB dataset from your public S3 bucket. Despite the bucket supposedly being public, I am faced with a NoCredentialsError (full Error Code at the bottom).

I am curious if there have been recent changes to the data's location or the public access settings on the bucket. Any assistance or updates you could provide would be greatly appreciated.

Thank you for your help!

Best regards, Stefan Hangler

Cell In[7], line 12
      8 print(bucket.objects.all())
     10 bucket_dir = "datasets/coati/"
---> 12 nfiles = len(list(bucket.objects.filter(Prefix=bucket_dir)))
     14 print(nfiles)
     16 download_from_s3("s3://terray-public/datasets/coati_data/")

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/boto3/resources/collection.py:81, in ResourceCollection.__iter__(self)
     78 limit = self._params.get('limit', None)
     80 count = 0
---> 81 for page in self.pages():
     82     for item in page:
     83         yield item

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/boto3/resources/collection.py:171, in ResourceCollection.pages(self)
    168 # Now that we have a page iterator or single page of results
    169 # we start processing and yielding individual items.
    170 count = 0
--> 171 for page in pages:
    172     page_items = []
    173     for item in self._handler(self._parent, params, page):

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/paginate.py:269, in PageIterator.__iter__(self)
    267 self._inject_starting_params(current_kwargs)
    268 while True:
--> 269     response = self._make_request(current_kwargs)
    270     parsed = self._extract_parsed_response(response)
    271     if first_request:
    272         # The first request is handled differently.  We could
    273         # possibly have a resume/starting token that tells us where
    274         # to index into the retrieved page.

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/paginate.py:357, in PageIterator._make_request(self, current_kwargs)
    356 def _make_request(self, current_kwargs):
--> 357     return self._method(**current_kwargs)

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/client.py:565, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    561     raise TypeError(
    562         f"{py_operation_name}() only accepts keyword arguments."
    563     )
    564 # The "self" in this scope is referring to the BaseClient.
--> 565 return self._make_api_call(operation_name, kwargs)

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/client.py:1001, in BaseClient._make_api_call(self, operation_name, api_params)
    997     maybe_compress_request(
    998         self.meta.config, request_dict, operation_model
    999     )
   1000     apply_request_checksum(request_dict)
-> 1001     http, parsed_response = self._make_request(
   1002         operation_model, request_dict, request_context
   1003     )
   1005 self.meta.events.emit(
   1006     'after-call.{service_id}.{operation_name}'.format(
   1007         service_id=service_id, operation_name=operation_name
   (...)
   1012     context=request_context,
   1013 )
   1015 if http.status_code >= 300:

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/client.py:1027, in BaseClient._make_request(self, operation_model, request_dict, request_context)
   1025 def _make_request(self, operation_model, request_dict, request_context):
   1026     try:
-> 1027         return self._endpoint.make_request(operation_model, request_dict)
   1028     except Exception as e:
   1029         self.meta.events.emit(
   1030             'after-call-error.{service_id}.{operation_name}'.format(
   1031                 service_id=self._service_model.service_id.hyphenize(),
   (...)
   1035             context=request_context,
   1036         )

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/endpoint.py:119, in Endpoint.make_request(self, operation_model, request_dict)
    113 def make_request(self, operation_model, request_dict):
    114     logger.debug(
    115         "Making request for %s with params: %s",
    116         operation_model,
    117         request_dict,
    118     )
--> 119     return self._send_request(request_dict, operation_model)

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/endpoint.py:198, in Endpoint._send_request(self, request_dict, operation_model)
    196 context = request_dict['context']
    197 self._update_retries_context(context, attempts)
--> 198 request = self.create_request(request_dict, operation_model)
    199 success_response, exception = self._get_response(
    200     request, operation_model, context
    201 )
    202 while self._needs_retry(
    203     attempts,
    204     operation_model,
   (...)
    207     exception,
    208 ):

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/endpoint.py:134, in Endpoint.create_request(self, params, operation_model)
    130     service_id = operation_model.service_model.service_id.hyphenize()
    131     event_name = 'request-created.{service_id}.{op_name}'.format(
    132         service_id=service_id, op_name=operation_model.name
    133     )
--> 134     self._event_emitter.emit(
    135         event_name,
    136         request=request,
    137         operation_name=operation_model.name,
    138     )
    139 prepared_request = self.prepare_request(request)
    140 return prepared_request

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/hooks.py:412, in EventAliaser.emit(self, event_name, **kwargs)
    410 def emit(self, event_name, **kwargs):
    411     aliased_event_name = self._alias_event_name(event_name)
--> 412     return self._emitter.emit(aliased_event_name, **kwargs)

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/hooks.py:256, in HierarchicalEmitter.emit(self, event_name, **kwargs)
    245 def emit(self, event_name, **kwargs):
    246     """
    247     Emit an event by name with arguments passed as keyword args.
    248 
   (...)
    254              handlers.
    255     """
--> 256     return self._emit(event_name, kwargs)

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/hooks.py:239, in HierarchicalEmitter._emit(self, event_name, kwargs, stop_on_response)
    237 for handler in handlers_to_call:
    238     logger.debug('Event %s: calling handler %s', event_name, handler)
--> 239     response = handler(**kwargs)
    240     responses.append((handler, response))
    241     if stop_on_response and response is not None:

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/signers.py:105, in RequestSigner.handler(self, operation_name, request, **kwargs)
    100 def handler(self, operation_name=None, request=None, **kwargs):
    101     # This is typically hooked up to the "request-created" event
    102     # from a client's event emitter.  When a new request is created
    103     # this method is invoked to sign the request.
    104     # Don't call this method directly.
--> 105     return self.sign(operation_name, request)

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/signers.py:199, in RequestSigner.sign(self, operation_name, request, region_name, signing_type, expires_in, signing_name)
    196     else:
    197         raise e
--> 199 auth.add_auth(request)

File ~/opt/anaconda3/envs/coatiEnv/lib/python3.9/site-packages/botocore/auth.py:418, in SigV4Auth.add_auth(self, request)
    416 def add_auth(self, request):
    417     if self.credentials is None:
--> 418         raise NoCredentialsError()
    419     datetime_now = datetime.datetime.utcnow()
    420     request.context['timestamp'] = datetime_now.strftime(SIGV4_TIMESTAMP)

NoCredentialsError: Unable to locate credentials
alexterray commented 1 month ago

Hey Stefan, thanks for the kind words and giving this a shot! I think I see what's up on the COATI dataset access here.

So I can confirm that the dataset is still configured to be public in our S3, but it looks like CLI (and presumably boto3 access) still require credentials to sign all S3 requests programatically regardless of the permissions on the bucket/objects. Here's what I tested.

# Confirm this server does not have any AWS credentials
$ aws sts get-caller-identity
Unable to locate credentials. You can configure credentials by running "aws configure".

# And it appears to be unhappy without any credentials present
$ aws s3 cp s3://terray-public/datasets/coati_data/0.pkl .
fatal error: Unable to locate credentials

# Let's verify the bucket contents are indeed public over HTTP
$ wget https://terray-public.s3.us-west-2.amazonaws.com/datasets/coati_data/0.pkl
--2024-06-17 09:40:03--  https://terray-public.s3.us-west-2.amazonaws.com/datasets/coati_data/0.pkl
Resolving terray-public.s3.us-west-2.amazonaws.com (terray-public.s3.us-west-2.amazonaws.com)... 52.92.180.2, 52.218.181.25, 3.5.87.208, ...
Connecting to terray-public.s3.us-west-2.amazonaws.com (terray-public.s3.us-west-2.amazonaws.com)|52.92.180.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74265858 (71M) [binary/octet-stream]
Saving to: ‘0.pkl’

0.pkl                   100%[=============================>]  70.83M  8.27MB/s    in 8.7s    

2024-06-17 09:40:12 (8.17 MB/s) - ‘0.pkl’ saved [74265858/74265858]
# ...and it seems to be the case.
# I can verify the HTTP URL also works on an incognito browser window

Here are a couple options your getting your access

  1. Run aws configure with any valid credentials you may have on an AWS tenant. The permissions of that user should be irrelevant for the public dataset, but will at least provide valid data to sign your requests with
  2. Disable the boto3 signing step, which seems to vary depending on your version. Check out this StackOverflow answer that has a few different suggestions (that may also vary depending if you use boto3.client vs boto3.resource): https://stackoverflow.com/questions/34865927/can-i-use-boto3-anonymously/34866092#34866092
StefanHangler commented 1 month ago

Thanks for the quick and detailed response! That was very helpful.

I managed to work around the credential issue with a boto3 setup to bypass the signing step. Here’s the code snippet that ended up working for me with boto3 version 1.9.251:

import boto3

def download_public_file(bucket_name, object_key, download_path):
    s3_client = boto3.client('s3', aws_access_key_id='', aws_secret_access_key='', region_name='us-west-2')
    s3_client._request_signer.sign = (lambda *args, **kwargs: None)

    try:
        s3_client.download_file(bucket_name, object_key, download_path)
        print(f"Downloaded {object_key} to {download_path}")
    except Exception as e:
        print(f"Failed to download {object_key}: {e}")

# Usage
bucket_name = 'terray-public'
object_key = 'datasets/coati_data/0.pkl'
download_path = 'path/to/store/file/0.pkl'
download_public_file(bucket_name, object_key, download_path)

I also noticed that the file 0.pkl is around 75MB, which seems much smaller than expected. The dataset.py mentions that the dataset should be around 340GB. Could you confirm if 0.pkl is part of a larger set of files or if there's another location where the full dataset is stored?

Thanks again for your assistance!

Best, Stefan