Closed simone-viozzi closed 10 months ago
@simone-viozzi yes, thanks for creating the ticket. find
is deeply broken, unfortunately. It was originally implemented to serve DVC needs only and we never had time to get back and fix it properly :(
Any idea of a roadmap to fix it? With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.
I could try, but I still don't understand how the cache works right now, and how it should work. If I had to implement the cache mechanism from scratch, I would most likely use something like this
With this broken https://github.com/iterative/PyDrive2/pull/222 is pretty much useless, it will work unexpectedly 90% of the time.
Hmm, I think cp
doesn't depend on find
. Cache itself is not broken, it's find
implementation is bad (not general). I think tbh, that we could move forward and merge w/o cache. Especially the cp
one.
copy
uses expand_path
, which uses find
.
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L877
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L908
cp_file
is fine, though, but works for just files.
okay, problems that I see:
maxdepth
is not supportedwithdirs
is not supported (must have for expand path)find('.')
and find('etasderwer')
return the same resultHey @shcheklein, did you know that find
is already implemented by ffspec
and it uses walk
:
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L423-L453
And also walk
is already implemented by ffspec
using just ls
:
https://github.com/fsspec/filesystem_spec/blob/2633445fc54797c79a7ac96d213bd24dfbdfcdc2/fsspec/spec.py#L364-L421
@simone-viozzi yes, I know. The default implementation was not good enough for the DVC that is the major driver / consumer of this fsspec for now.
While working on https://github.com/iterative/PyDrive2/pull/222, I discovered that
find
has a bug with the cache.Let assume
self.path=root/tmp/
and a folder structure like:now let's do some tests:
and that is correct, but if we do only
find('root/tmp/fo1/fo2')
:This happens because find relay on the cache, and at the start the cache is only populated with ids from one level down
self.path
so in the last example, the content of the cache is:
I think, because there is no
tmp/fo1/fo2
(the starting path of find),query_ids
stays empty and the method return an empty list.The lines of code involved are: https://github.com/iterative/PyDrive2/blob/27bbf4c8b4ef7fc3cce97f645b0e79d6795bf73c/pydrive2/fs/spec.py#L469-L483