remove `find` implementation

skshetry commented 1 year ago

fsspec by default provides a find implementation that is built on top of ls and info. Looking at the current implementation, it did not seem like it was recursive (maybe I am wrong here?), and looked complicated that I'd rather simplify and default to fsspec, and specialize only if needed.

shcheklein commented 1 year ago

@skshetry this is implementation most likely needed (it creates a cache of top level folders most likely and their IDs). I think btw we need to update it to handle the new cache format - it might be broken, not sure.

skshetry commented 1 year ago

@skshetry this is implementation most likely needed (it creates a cache of top level folders most likely and their IDs). I think btw we need to update it to handle the new cache format - it might be broken, not sure.

Shouldn't that happen as part of ls(), which should be caching too?

https://github.com/iterative/PyDrive2/blob/b6c83b20396e4aa7e96838437cbc6feca917280b/pydrive2/fs/spec.py#L465

shcheklein commented 1 year ago

Since ls was there before as well, I would be careful here. I can try to find some time to restore some context. Overall, the whole implementation is pretty bad and broken from the fsspec perspective, it was made optimal for DVC at that time.

skshetry commented 1 year ago

Since ls was there before as well, I would be careful here. I can try to find some time to restore some context. Overall, the whole implementation is pretty bad and broken from the fsspec perspective, it was made optimal for DVC at that time.

yup, agreed. find only worked for the root of the remote, and this worked fine for us in dvc when remote cache was at the root level, but now as we moved to <remote>/files/md5, find no longer works. Although I'd prefer we merge this, as all the tests passes for dvc in https://github.com/iterative/dvc-gdrive/pull/28, we can generalize the filesystem at a later time.

skshetry commented 1 year ago

Again, all the changes to cache structure can be and should be done without implementing find. ls and info are appropriate place for change.

(We will only need to implement find if there's a cheaper way to recursively list files in gdrive).

shcheklein commented 1 year ago

Although I'd prefer we merge this, as all the tests passes for dvc in https://github.com/iterative/dvc-gdrive/pull/28, we can generalize the filesystem at a later time.

I don't understand the implications of this, If have time to research - compare the number of API requests, performance, MT safety that would be great. Or at least some explanation why did we have this. W/o that let's not merge this please.

We need to test it on the new cache format and fix it if needed.

shcheklein commented 1 year ago

Some context that I remember. In Google drive:

names are not unique
you need an ID to access any directory (e.g. to list files inside)
API calls are expensive (rate-limited) - can become easily a bottleneck

Thus it was important to do:

create once (and only once!) all the 00, 01, ..... ff and get their IDs
make sure that we have them always in memory and don't hit APIs multiple times for that

find might have been related to all of that.

And, again, we need to review it for the new structure, otherwise it's indeed can be quite broken for DVC.

skshetry commented 1 year ago

I see that in find, it does:

https://github.com/iterative/PyDrive2/blob/a5dc1d9a4da73f8b4172c1020726bd457fb62213/pydrive2/fs/spec.py#L495

which should be implemented in ls() too.

https://github.com/iterative/PyDrive2/blob/a5dc1d9a4da73f8b4172c1020726bd457fb62213/pydrive2/fs/spec.py#L440-L444. Will look into it.

skshetry commented 1 year ago

The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).

Regarding multithreading, it should be safe (thanks to GIL).

skshetry commented 1 year ago

otherwise it's indeed can be quite broken for DVC.

Note that it is broken for DVC right now.

skshetry commented 1 year ago

The failure in Python 3.7 on macOS is unrelated, see https://github.com/actions/setup-python/issues/682.

It is also fixed in fsspec, see https://github.com/fsspec/filesystem_spec/pull/1295. Anyway, we can pin Python to "3.7.16" for now (in a separate PR): https://github.com/iterative/PyDrive2/blob/a5dc1d9a4da73f8b4172c1020726bd457fb62213/.github/workflows/test.yml#L27

shcheklein commented 1 year ago

Just to be on the same page. Even with this patch it's broken for the new DVC. E.g. here when we initialize we cache root ids, we don't cache all of the now. Upstream we should be also creating them once, etc.

For this patch. How do we use / used find upstream, in DVC - could point me to the code please?

skshetry commented 1 year ago

Even with this patch it's broken for the new DVC

I got confirmation from user that it's broken for legacy ODB from a user. I need to investigate that.

E.g. here when we initialize we cache root ids, we don't cache all of the now.

It gets cached when _ids_cache is invoked, no?

https://github.com/iterative/PyDrive2/blob/f976876dd3a5673ddc8ca4fda8100a71b228ffa5/pydrive2/fs/spec.py#L430

For this patch. How do we use / used find upstream, in DVC - could point me to the code please?

This is where we use find.

https://github.com/iterative/dvc-objects/blob/7e07ccf42cf85f9fbea6b142a0609d796cba0539/src/dvc_objects/db.py#L216

efiop commented 1 year ago

Seems like as always gdrive is very involved and the state of pydrive2 and related stuff is not the best. Along with pydrive2 using legacy API version that might get dropped in the future, I just want to say, for the record, that dropping gdrive is also an option that we should consider here if we can't fix it quickly. It is great for onboarding, but probably doesn't contribute anything to the customers and we need to carefully consider if we can spend time on this. Not saying we should drop it now, but that we should keep that as a possible solution.

shcheklein commented 1 year ago

The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).

This find implementation is very DVC-specific and GDrive-specific. It optimizes the number of API calls we make.

As soon as DVC starts it caches all 00, 01, .... ff. (on create dir we update it in thread-safety way). It's a single API calls.
When we run find we take only those (we don't care about file under root, etc). And run a single query to start fetching w/o listing the root again (each time an API call) + then making potentially multiple calls per each directory inside.

Things that are potentially broken / not optimal is that we don't pre-cache file/md5/00 ... file/md5/ff roots now (unless we initialize two different FSs instances with different roots?)

Otherwise I don't see why it would be broken. Were you able to reproduce it? I can run and probably fix it quickly then.

skshetry commented 1 year ago

The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).

This find implementation is very DVC-specific and GDrive-specific. It optimizes the number of API calls we make.
1. As soon as DVC starts it caches all `00`, `01`, .... `ff`. (on create dir we update it in thread-safety way). It's a single API calls.

This does cache ids lazily now during ls which should have similar effect

2. When we run find we take only those (we don't care about file under root, etc). And run a _single_ query to start fetching w/o listing the root again (each time an API call) + then making potentially multiple calls per each directory inside.

This is the one that is broken at this time, and this PR avoids that for simplicity.

Otherwise I don't see why it would be broken. Were you able to reproduce it? I can run and probably fix it quickly then.

I have an alternative patch that caches path provided to fs.find(). Will create a PR.

shcheklein commented 1 year ago

@skshetry was the user problem reproduced? Can we start with that please. If there is way to reproduce there should be probably a simple fix?

Number of API calls is critical for GDrive DVC- thus this optimization. Unless we really have a food reason for this, let's not try to simplify it for now please?

skshetry commented 1 year ago

@skshetry was the user problem reproduced? Can we start with that please. If there is way to reproduce there should be probably a simple fix?

@shcheklein, there were two issues:

fs.ls("files/md5") was returning None when the directory did not exist (instead of raising FileNotFoundError), which is fixed in #283.
fs.find("files/md5") returns an empty list, because it is only using dir_ids cached from root, and does not populate dir_ids from files/md5.

skshetry commented 1 year ago

Closing in favour of #286.

iterative / PyDrive2

remove `find` implementation #285