liormizr / s3path

s3path is a pathlib extension for AWS S3 Service
Apache License 2.0
208 stars 40 forks source link

Glob-related error appeared on 0.3.3 #97

Closed avibrazil closed 2 years ago

avibrazil commented 2 years ago

0.3.2 was working fine until I upgraded to 0.3.3 and then the following error started to happen. I don't know why and where is the problem but going back to 0.3.2 fixes the problem.

  File "/home/sagemaker-user/Kroft/my_avm/avm/avm.py", line 1063, in load
    available=[f for f in list(resolved_path.glob(filename)) if '.dvc' not in str(f)]
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 706, in glob
    yield from super().glob(pattern)
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 1177, in glob
    for p in selector.select_from(self):
  File "/home/sagemaker-user/.pyenv/versions/3.9.9/lib/python3.9/pathlib.py", line 523, in select_from
    if not is_dir(parent_path):
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 678, in is_dir
    return self._accessor.is_dir(self)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 169, in is_dir
    resource, _ = self.configuration_map.get_configuration(path)
  File "/home/sagemaker-user/.local/lib/python3.9/site-packages/s3path.py", line 90, in get_configuration
    if resources is None and path in self.resources:
TypeError: argument of type 'NoneType' is not iterable

filename variable contains something like random • * • *.pkl*. And I don't think this is relevant information but this code is a static method of a class and it was running twice, in 2 parallel threads, where first thread succeeded and second failed with above error.

liormizr commented 2 years ago

@avibrazil thanks for the issue

How can I reproduce it? can you please give me a line to run or a small script?

sbrandtb commented 2 years ago

Same issue here, also threading-related. Relevant stack trace:

  File "/usr/local/lib/python3.9/pathlib.py", line 1264, in write_bytes
    with self.open(mode='wb') as f: 
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 721, in open
    return self._accessor.open(         
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 205, in open
    resource, config = self.configuration_map.get_configuration(path)
  File "/usr/local/lib/python3.9/site-packages/s3path.py", line 90, in get_configuration
    if resources is None and path in self.resources:                                                      
TypeError: argument of type 'NoneType' is not iterable   

The simplified code goes like this:

src = Path('some-local')
dst = S3Path.from_uri('s3://something') / 'foo.csv'
dst.write_bytes(src.read_bytes())

And run this massively parallel. Today I found out that creating boto3 clients is not thread safe and changed my own code like this. s3path is creating its own boto3 resource in class _S3Accessor, maybe that is the culprit.

sbrandtb commented 2 years ago

I submitted a fix. Since I understand the issue better now, here is a workaround that should fix the issue until a new version is released: Basically, use any feature of the library without threads at the beginning of your application. It will make the library initialise without the concurrency issue that is the cause of the exception. I.e.:

S3Path.from_uri('s3://some-bucket-you-have-access-to/').exists()

There are probably other methods that are even shorter, but this one works.

avibrazil commented 2 years ago

I’m not sure I got it.

You mean, call the lib somehow before any threading and then the lib can be used later in threading contexts?

Is this the temporary fix?

sbrandtb commented 2 years ago

@avibrazil Sorry, I realise I was not clear :disappointed:

Take the example from the pull request:

from multiprocessing.pool import ThreadPool

from s3path import S3Path

def do(i):
    dst = S3Path.from_uri('s3://something') / f'hello_{i}.txt'
    dst.write_bytes(b'hello')

# This line will "fix" the issue because it does ultimately call
# _S3ConfigurationMap.get_configuration()
S3Path.from_uri('s3://some-bucket-you-have-access-to/').exists()

# Now, do your threaded stuff.
ThreadPool(processes=20).map(do, range(100))

The thing is, it needs to call _S3ConfigurationMap.get_configuration(). Pardon me that I do not compile a list of possible functions that do this, because my time is limited :slightly_smiling_face: With the code above you can experiment and find one that also works, if the provided one is not good for you, i.e. because it requires some permission you do not have or so.

EDIT: The important part is calling exists(). How you construct S3Path does not matter.

liormizr commented 2 years ago

Hi all,

I merged the fix I'll update in this issue when I'll create a new release

Thanks!

sbrandtb commented 2 years ago

@liormizr Anything we can help with? This issue is blocking my upgrade to Python 3.10, so I am eager to see the new release :slightly_smiling_face:

liormizr commented 2 years ago

@sbrandtb will it help if I'll deploy a new version now only with this fix?

sbrandtb commented 2 years ago

@liormizr yes :slightly_smiling_face:

liormizr commented 2 years ago

Version 0.3.4 deployed to PyPi With only this issue fix Sorry for the delay

I'll update tomorrow when I'll deploy to Conda

liormizr commented 2 years ago

Version deployed to Conda

avibrazil commented 2 years ago

Oh, that's great! I'll test it later. But having it published, maybe it is time to close this bug.

sbrandtb commented 2 years ago

Update: Unrelated. This issue was related to this workaround

After updating and running my project in our CI, I am seemingly randomly getting

ValueError: the bucket 'xxx does not exist, or is forbidden for access (ClientError('An error occurred (InvalidAccessKeyId) when calling the CreateMultipartUpload operation: The AWS Access Key Id you provided does not exist in our records.'))

I did not observe this while testing the change locally and I am not sure if if is not connected to another change when migrating to Python 3.10 and updating packages, but want to drop this message here so if anyone else encounters it, he can +1. After all, it looks like some other non-deterministic thing.