liormizr / s3path

s3path is a pathlib extension for AWS S3 Service
Apache License 2.0
206 stars 39 forks source link

New Glob Algorithm Doesn't Work on Non-Top-Level Directories #115

Closed aperiodic closed 1 year ago

aperiodic commented 1 year ago

It appears that the new glob algorithm in S3Path 0.4.0 doesn't work if the directory on which .glob() is being called is nested within a bucket, rather than being at the top level of a bucket. Depending on how much nesting is involved, either an error is raised, or the glob incorrectly returns no results.

Steps to Reproduce

  1. Create a file in an S3 bucket contained within a nested directory. For example, I used this file:
    % aws s3 ls s3://my-bucket/s3path-test/nested/test.txt
    2023-01-05 15:00:46         13 test.txt
  2. Use the following code to create an S3Path instance representing the directory containing the file:
    
    import s3path

nested_dir = s3path.S3Path("/my-bucket/s3path-test/nested")

3. Call `nested_dir.glob("*")`

### Expected Result
The `.glob()` method returns an iterator that produces exactly one S3Path instance, representing the enclosed `test.txt` file.

### Actual Result
An `IndexError: pop from an empty deque` error is raised, with the following stacktrace:

```python
File venv/lib/python3.8/site-packages/s3path.py:905, in S3Path.glob(self, pattern)
    903     yield from super().glob(pattern)
    904     return
--> 905 yield from self._glob(pattern)

File venv/lib/python3.8/site-packages/s3path.py:920, in S3Path._glob(self, pattern)
    918         raise ValueError("Invalid pattern: '**' can only be an entire path component")
    919 selector = _Selector(self, pattern=pattern)
--> 920 yield from selector.select()

File venv/lib/python3.8/site-packages/s3path.py:632, in _Selector.select(self)
    631 def select(self):
--> 632     for target in self._deep_cached_dir_scan():
    633         target = self._path._flavour.sep.join(('', self._path.bucket, target))
    634         if self.match(target):

File venv/lib/python3.8/site-packages/s3path.py:694, in _Selector._deep_cached_dir_scan(self)
    692 if self._target_level is None or self._target_level == index:
    693     yield target_path
--> 694 cache.add(target_path_parts, target_path)

File venv/lib/python3.8/site-packages/s3path.py:718, in _DeepDirCache.add(self, directory_parts, directory)
    716         tree.clear()
    717         for _ in range(deep_count):
--> 718             self._queue.pop()
    719     tree[part] = {}
    720 self._queue.append(directory)

IndexError: pop from an empty deque

If you repeat the reproduction steps using another level of nesting, e.g. after creating the S3 file s3://my-bucket/s3path-test/nested/further/test.txt, then instead of a stack trace being raised, the iterator is incorrectly empty:

doubly_nested_dir = s3path.S3Path("/my-bucket/s3path-test/nested/further")
list(doubly_nested_dir.glob("*"))
>>> []

Workaround

This can be avoided by configuring s3path to use the old glob algorithm, before creating any S3Path instances. If I import s3path, call:

s3path.register_configuration_parameter(s3path.PureS3Path("/"), glob_new_algorithm=False)

and then follow the reproduction steps above, I get the correct result. This makes me confident that the problem is entirely specific to the new globbing algorithm.

I hope this is useful! Please let me know if you'd like any more information from me about this.

Finally, thank you for creating & maintaining S3Path! It's been very nice for me and my team at work, by allowing us to write code that deals only with the Path abstraction, and be able to use that code both for local paths or S3 ones.

liormizr commented 1 year ago

interesting We have a test that check it and pass I see know that if I add only nested folder it works :man_facepalming: OK checking ...

aperiodic commented 1 year ago

Yeah, before I opened this issue I took a look at the tests and observed that glob test only calls .glob() on a top-level directory S3Path instance, and the rest of the calls are on instances corresponding to the bucket itself, so the test doesn't cover this case.

liormizr commented 1 year ago

Version 0.4.1 deployed with a fix and another test to check this scenario

aperiodic commented 1 year ago

Thank you! We've confirmed that 0.4.1 fixes this issue for us. I'll mark this issue as resolved.