Problem with the cache in `GDriveFileSystem.find`

simone-viozzi commented 2 years ago

While working on https://github.com/iterative/PyDrive2/pull/222, I discovered that find has a bug with the cache.

Let assume self.path=root/tmp/ and a folder structure like:

root/tmp/
        └── fo1/
           ├── file2.pdf
           └── fo2/
              ├── file3.pdf
              └── fo3/
                 └── file4.pdf

now let's do some tests:

f = fs.find('root/tmp/fo1/')
print(f)
> ['root/tmp/fo1/file2.pdf', 'root/tmp/fo1/fo2/file3.pdf', 'root/tmp/fo1/fo2/fo3/file4.pdf']

f = fs.find('root/tmp/fo1/fo2')
print(f)
> ['root/tmp/fo1/fo2/fo3/file4.pdf', 'root/tmp/fo1/fo2/file3.pdf']

and that is correct, but if we do only find('root/tmp/fo1/fo2'):

f = fs.find('root/tmp/fo1/fo2')
print(f)
>[]

This happens because find relay on the cache, and at the start the cache is only populated with ids from one level down self.path

so in the last example, the content of the cache is:

{
'1IETDYYj23PgGaInZofa9MyANyBlOoiyh': 'tmp', 
'1k6u2-FStB6rOlq6hmDXlRl2aLES1l6vp': 'tmp/fo1', 
}

I think, because there is no tmp/fo1/fo2 (the starting path of find), query_ids stays empty and the method return an empty list.

The lines of code involved are: https://github.com/iterative/PyDrive2/blob/27bbf4c8b4ef7fc3cce97f645b0e79d6795bf73c/pydrive2/fs/spec.py#L469-L483

shcheklein commented 2 years ago

@simone-viozzi yes, thanks for creating the ticket. find is deeply broken, unfortunately. It was originally implemented to serve DVC needs only and we never had time to get back and fix it properly :(

simone-viozzi commented 2 years ago

Any idea of a roadmap to fix it? With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.

I could try, but I still don't understand how the cache works right now, and how it should work. If I had to implement the cache mechanism from scratch, I would most likely use something like this

shcheklein commented 2 years ago

With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.

Hmm, I think cp doesn't depend on find. Cache itself is not broken, it's find implementation is bad (not general). I think tbh, that we could move forward and merge w/o cache. Especially the cp one.

simone-viozzi commented 2 years ago

copy uses expand_path, which uses find. https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L877 https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L908

cp_file is fine, though, but works for just files.

shcheklein commented 2 years ago

okay, problems that I see:

[ ] maxdepth is not supported
[ ] withdirs is not supported (must have for expand path)
[ ] doesn't check bucket name find('.') and find('etasderwer') return the same result
[ ] it keeps appending ids to cache every time even for already known
[ ] cache is not locked (multithreading support) - this is a bit more advanced and less critical to start with

simone-viozzi commented 2 years ago

Hey @shcheklein, did you know that find is already implemented by ffspec and it uses walk: https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L423-L453

And also walk is already implemented by ffspec using just ls: https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L364-L421

shcheklein commented 2 years ago

@simone-viozzi yes, I know. The default implementation was not good enough for the DVC that is the major driver / consumer of this fsspec for now.

iterative / PyDrive2

Problem with the cache in `GDriveFileSystem.find` #229