Closed LeoQuote closed 1 year ago
hmmm I checked the list_object_v2 api again and found that it only produce folders when scan folder for the first time, this is a long last problem....
I found that if I set Delimiter to '', all contents of the folder will be returned. maybe we can use this to make it more efficient.
https://github.com/liormizr/s3path/blob/a20403d49450987d0dc3955df1c4db2d2968bb7e/s3path.py#L115-L117
I replaced glob method to
def glob(self, pattern):
bucket_name = self.bucket
resource, _ = self._accessor.configuration_map.get_configuration(self)
if not bucket_name:
for bucket in resource.buckets.filter(Prefix=str(self)):
yield S3Path(bucket)
return
bucket = resource.Bucket(bucket_name)
kwargs = {
'Bucket': bucket_name,
'Prefix': self._accessor.generate_prefix(self),
'Delimiter': ''}
continuation_token = None
while True:
if continuation_token:
kwargs['ContinuationToken'] = continuation_token
response = bucket.meta.client.list_objects_v2(**kwargs)
for file in response['Contents']:
file_path = S3Path(f"/{bucket_name}/{file['Key']}")
if fnmatch(str(file_path.relative_to(self)), pattern):
yield file_path
if not response.get('IsTruncated'):
break
continuation_token = response.get('NextContinuationToken')
and it worked! but only for files not for folders.
Hi @LeoQuote, Sorry for the delay ...
Indeed the glod method isn't optimized at this time. Right now S3Path glob method is using the default pathlib algorithm. So we have room to optimize.
now about your suggestion, the pathlib glob is able to find folders. For example:
for path in Path.home().glob('*/'):
print(path)
we need to come up with an implementation that will keep the pathlib behaviour (find folders and files...)
Do you want to keep working on it? os should I take it from here
hmmm, I dont know, it's just for me I need all files like "*.html" in all subdirectories. If I use regular pathlib style and walk through all directories, it would be expensive.
I think we can add some s3-only method like, s3path.s3glob
to indicate that this method does not return exactly like pahtlib
.
You can take it from here.
OK I'll work on something and update with progress
Does it suck up an entire file list before globbing?
This is from pathlib.py
in the python stdlib:
def _select_from(self, parent_path, is_dir, exists, scandir):
try:
with scandir(parent_path) as scandir_it:
entries = list(scandir_it)
for entry in entries:
@four43 yes, you are right One of the optimizations that I want to do is remove this list creation in the s3 implementation
Oh man, python why!? That's a mega bummer!
If you want to contribute to stdlib... ;-)
https://docs.python.org/3/library/os.html#os.scandir.close Maybe in filesystem context thy need to cleanup? Not sure ...
Yeah they're using a context manager, which is fine, so it's closing and stuff. I'm guessing it's an optimization to open it once, rip everything into memory quick, then release it? That also seems like it would be bad for remote mounted file systems like NFS and friends.
Someone was trying to make stuff faster here: https://www.python.org/dev/peps/pep-0471/
Release in Version 0.4.0
Thanks for @liormizr 's work !
I tried glob method and found it is too slow when there're millions of files in the directory.
turns out that the glob method will first call
list_objects_v2
api first, get all files (every single file including folders and files), identify all files to see if they are folders. and then scan the folders.The algorighm is corret in traditional fs, while inefficient in s3, s3 will return every object when requesting
list_objects_v2
api, iterating subfolders are unneccessary.Is that possible to fix it in s3path or it can only be fixed in pathlib ?