iterative / PyDrive2

Google Drive API Python wrapper library. Maintained fork of PyDrive.
https://docs.iterative.ai/PyDrive2
Other
580 stars 69 forks source link

Problem with the cache in `GDriveFileSystem.find` #229

Closed simone-viozzi closed 10 months ago

simone-viozzi commented 2 years ago

While working on https://github.com/iterative/PyDrive2/pull/222, I discovered that find has a bug with the cache.

Let assume self.path=root/tmp/ and a folder structure like:

root/tmp/
        └── fo1/
           ├── file2.pdf
           └── fo2/
              ├── file3.pdf
              └── fo3/
                 └── file4.pdf

now let's do some tests:

f = fs.find('root/tmp/fo1/')
print(f)
> ['root/tmp/fo1/file2.pdf', 'root/tmp/fo1/fo2/file3.pdf', 'root/tmp/fo1/fo2/fo3/file4.pdf']

f = fs.find('root/tmp/fo1/fo2')
print(f)
> ['root/tmp/fo1/fo2/fo3/file4.pdf', 'root/tmp/fo1/fo2/file3.pdf']

and that is correct, but if we do only find('root/tmp/fo1/fo2'):

f = fs.find('root/tmp/fo1/fo2')
print(f)
>[]

This happens because find relay on the cache, and at the start the cache is only populated with ids from one level down self.path

so in the last example, the content of the cache is:

{
'1IETDYYj23PgGaInZofa9MyANyBlOoiyh': 'tmp', 
'1k6u2-FStB6rOlq6hmDXlRl2aLES1l6vp': 'tmp/fo1', 
}

I think, because there is no tmp/fo1/fo2 (the starting path of find), query_ids stays empty and the method return an empty list.

The lines of code involved are: https://github.com/iterative/PyDrive2/blob/27bbf4c8b4ef7fc3cce97f645b0e79d6795bf73c/pydrive2/fs/spec.py#L469-L483

shcheklein commented 2 years ago

@simone-viozzi yes, thanks for creating the ticket. find is deeply broken, unfortunately. It was originally implemented to serve DVC needs only and we never had time to get back and fix it properly :(

simone-viozzi commented 2 years ago

Any idea of a roadmap to fix it? With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.

I could try, but I still don't understand how the cache works right now, and how it should work. If I had to implement the cache mechanism from scratch, I would most likely use something like this

shcheklein commented 2 years ago

With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.

Hmm, I think cp doesn't depend on find. Cache itself is not broken, it's find implementation is bad (not general). I think tbh, that we could move forward and merge w/o cache. Especially the cp one.

simone-viozzi commented 2 years ago

copy uses expand_path, which uses find. https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L877 https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L908

cp_file is fine, though, but works for just files.

shcheklein commented 2 years ago

okay, problems that I see:

simone-viozzi commented 2 years ago

Hey @shcheklein, did you know that find is already implemented by ffspec and it uses walk: https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L423-L453

And also walk is already implemented by ffspec using just ls: https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L364-L421

shcheklein commented 2 years ago

@simone-viozzi yes, I know. The default implementation was not good enough for the DVC that is the major driver / consumer of this fsspec for now.