liormizr / s3path

s3path is a pathlib extension for AWS S3 Service
Apache License 2.0
208 stars 40 forks source link

glob is inefficient as it's iterating a dir that already scanned #85

Closed LeoQuote closed 1 year ago

LeoQuote commented 3 years ago

I tried glob method and found it is too slow when there're millions of files in the directory.

turns out that the glob method will first call list_objects_v2 api first, get all files (every single file including folders and files), identify all files to see if they are folders. and then scan the folders.

The algorighm is corret in traditional fs, while inefficient in s3, s3 will return every object when requesting list_objects_v2 api, iterating subfolders are unneccessary.

Is that possible to fix it in s3path or it can only be fixed in pathlib ?

LeoQuote commented 3 years ago

hmmm I checked the list_object_v2 api again and found that it only produce folders when scan folder for the first time, this is a long last problem....

LeoQuote commented 3 years ago

I found that if I set Delimiter to '', all contents of the folder will be returned. maybe we can use this to make it more efficient.

https://github.com/liormizr/s3path/blob/a20403d49450987d0dc3955df1c4db2d2968bb7e/s3path.py#L115-L117

LeoQuote commented 3 years ago

I replaced glob method to

    def glob(self, pattern):
        bucket_name = self.bucket
        resource, _ = self._accessor.configuration_map.get_configuration(self)
        if not bucket_name:
            for bucket in resource.buckets.filter(Prefix=str(self)):
                yield S3Path(bucket)
            return
        bucket = resource.Bucket(bucket_name)

        kwargs = {
            'Bucket': bucket_name,
            'Prefix': self._accessor.generate_prefix(self),
            'Delimiter': ''}
        continuation_token = None
        while True:
            if continuation_token:
                kwargs['ContinuationToken'] = continuation_token
            response = bucket.meta.client.list_objects_v2(**kwargs)
            for file in response['Contents']:
                file_path = S3Path(f"/{bucket_name}/{file['Key']}")
                if fnmatch(str(file_path.relative_to(self)), pattern):
                    yield file_path
            if not response.get('IsTruncated'):
                break
            continuation_token = response.get('NextContinuationToken')

and it worked! but only for files not for folders.

liormizr commented 3 years ago

Hi @LeoQuote, Sorry for the delay ...

Indeed the glod method isn't optimized at this time. Right now S3Path glob method is using the default pathlib algorithm. So we have room to optimize.

now about your suggestion, the pathlib glob is able to find folders. For example:

for path in Path.home().glob('*/'):
     print(path)

we need to come up with an implementation that will keep the pathlib behaviour (find folders and files...)

Do you want to keep working on it? os should I take it from here

LeoQuote commented 3 years ago

hmmm, I dont know, it's just for me I need all files like "*.html" in all subdirectories. If I use regular pathlib style and walk through all directories, it would be expensive.

I think we can add some s3-only method like, s3path.s3glob to indicate that this method does not return exactly like pahtlib.

You can take it from here.

liormizr commented 3 years ago

OK I'll work on something and update with progress

four43 commented 3 years ago

Does it suck up an entire file list before globbing?

This is from pathlib.py in the python stdlib:

def _select_from(self, parent_path, is_dir, exists, scandir):
    try:
        with scandir(parent_path) as scandir_it:
            entries = list(scandir_it)
        for entry in entries:
liormizr commented 3 years ago

@four43 yes, you are right One of the optimizations that I want to do is remove this list creation in the s3 implementation

four43 commented 3 years ago

Oh man, python why!? That's a mega bummer!

liormizr commented 3 years ago

If you want to contribute to stdlib... ;-)

https://docs.python.org/3/library/os.html#os.scandir.close Maybe in filesystem context thy need to cleanup? Not sure ...

four43 commented 3 years ago

Yeah they're using a context manager, which is fine, so it's closing and stuff. I'm guessing it's an optimization to open it once, rip everything into memory quick, then release it? That also seems like it would be bad for remote mounted file systems like NFS and friends.

Someone was trying to make stuff faster here: https://www.python.org/dev/peps/pep-0471/

liormizr commented 1 year ago

Release in Version 0.4.0

LeoQuote commented 1 year ago

Thanks for @liormizr 's work !