Closed skshetry closed 1 year ago
@skshetry this is implementation most likely needed (it creates a cache of top level folders most likely and their IDs). I think btw we need to update it to handle the new cache format - it might be broken, not sure.
@skshetry this is implementation most likely needed (it creates a cache of top level folders most likely and their IDs). I think btw we need to update it to handle the new cache format - it might be broken, not sure.
Shouldn't that happen as part of ls()
, which should be caching too?
Since ls
was there before as well, I would be careful here. I can try to find some time to restore some context. Overall, the whole implementation is pretty bad and broken from the fsspec perspective, it was made optimal for DVC at that time.
Since
ls
was there before as well, I would be careful here. I can try to find some time to restore some context. Overall, the whole implementation is pretty bad and broken from the fsspec perspective, it was made optimal for DVC at that time.
yup, agreed. find
only worked for the root of the remote, and this worked fine for us in dvc when remote cache was at the root level, but now as we moved to <remote>/files/md5
, find
no longer works. Although I'd prefer we merge this, as all the tests passes for dvc in https://github.com/iterative/dvc-gdrive/pull/28, we can generalize the filesystem at a later time.
Again, all the changes to cache structure can be and should be done without implementing find
. ls
and info
are appropriate place for change.
(We will only need to implement find
if there's a cheaper way to recursively list files in gdrive).
Although I'd prefer we merge this, as all the tests passes for dvc in https://github.com/iterative/dvc-gdrive/pull/28, we can generalize the filesystem at a later time.
I don't understand the implications of this, If have time to research - compare the number of API requests, performance, MT safety that would be great. Or at least some explanation why did we have this. W/o that let's not merge this please.
We need to test it on the new cache format and fix it if needed.
Some context that I remember. In Google drive:
Thus it was important to do:
00, 01, ..... ff
and get their IDsfind
might have been related to all of that.
And, again, we need to review it for the new structure, otherwise it's indeed can be quite broken for DVC.
I see that in find
, it does:
which should be implemented in ls()
too.
https://github.com/iterative/PyDrive2/blob/a5dc1d9a4da73f8b4172c1020726bd457fb62213/pydrive2/fs/spec.py#L440-L444. Will look into it.
The patch now caches the ID of folders on ls
too. Previous implementation of find
only used one query id per directory which was reused from _ids_cache
. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list
anyway?).
Regarding multithreading, it should be safe (thanks to GIL).
otherwise it's indeed can be quite broken for DVC.
Note that it is broken for DVC right now.
The failure in Python 3.7 on macOS is unrelated, see https://github.com/actions/setup-python/issues/682.
It is also fixed in fsspec, see https://github.com/fsspec/filesystem_spec/pull/1295. Anyway, we can pin Python to "3.7.16" for now (in a separate PR): https://github.com/iterative/PyDrive2/blob/a5dc1d9a4da73f8b4172c1020726bd457fb62213/.github/workflows/test.yml#L27
Just to be on the same page. Even with this patch it's broken for the new DVC. E.g. here when we initialize we cache root ids, we don't cache all of the now. Upstream we should be also creating them once, etc.
For this patch. How do we use / used find
upstream, in DVC - could point me to the code please?
Even with this patch it's broken for the new DVC
I got confirmation from user that it's broken for legacy ODB from a user. I need to investigate that.
E.g. here when we initialize we cache root ids, we don't cache all of the now.
It gets cached when _ids_cache
is invoked, no?
For this patch. How do we use / used
find
upstream, in DVC - could point me to the code please?
This is where we use find
.
Seems like as always gdrive is very involved and the state of pydrive2 and related stuff is not the best. Along with pydrive2 using legacy API version that might get dropped in the future, I just want to say, for the record, that dropping gdrive is also an option that we should consider here if we can't fix it quickly. It is great for onboarding, but probably doesn't contribute anything to the customers and we need to carefully consider if we can spend time on this. Not saying we should drop it now, but that we should keep that as a possible solution.
The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).
This find
implementation is very DVC-specific and GDrive-specific. It optimizes the number of API calls we make.
00
, 01
, .... ff
. (on create dir we update it in thread-safety way). It's a single API calls.Things that are potentially broken / not optimal is that we don't pre-cache file/md5/00
... file/md5/ff
roots now (unless we initialize two different FSs instances with different roots?)
Otherwise I don't see why it would be broken. Were you able to reproduce it? I can run and probably fix it quickly then.
The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).
This
find
implementation is very DVC-specific and GDrive-specific. It optimizes the number of API calls we make.1. As soon as DVC starts it caches all `00`, `01`, .... `ff`. (on create dir we update it in thread-safety way). It's a single API calls.
This does cache ids lazily now during ls
which should have similar effect
2. When we run find we take only those (we don't care about file under root, etc). And run a _single_ query to start fetching w/o listing the root again (each time an API call) + then making potentially multiple calls per each directory inside.
This is the one that is broken at this time, and this PR avoids that for simplicity.
Otherwise I don't see why it would be broken. Were you able to reproduce it? I can run and probably fix it quickly then.
I have an alternative patch that caches path
provided to fs.find()
. Will create a PR.
@skshetry was the user problem reproduced? Can we start with that please. If there is way to reproduce there should be probably a simple fix?
Number of API calls is critical for GDrive DVC- thus this optimization. Unless we really have a food reason for this, let's not try to simplify it for now please?
@skshetry was the user problem reproduced? Can we start with that please. If there is way to reproduce there should be probably a simple fix?
@shcheklein, there were two issues:
fs.ls("files/md5")
was returning None
when the directory did not exist (instead of raising FileNotFoundError
), which is fixed in #283.fs.find("files/md5")
returns an empty list, because it is only using dir_ids
cached from root, and does not populate dir_ids
from files/md5
.Closing in favour of #286.
fsspec by default provides a
find
implementation that is built on top ofls
andinfo
. Looking at the current implementation, it did not seem like it was recursive (maybe I am wrong here?), and looked complicated that I'd rather simplify and default to fsspec, and specialize only if needed.