fs.find: cache path ids

skshetry commented 1 year ago

GDriveFileSystem was previously caching dir ids of root, and was using those on fs.find(). This worked well when the remote cache was at the root, but now since dvc uses /files/md5/ by default, the dir ids are no longer in the cache and find ends up returning an empty list.

This PR checks if the path is cached, and if not, it caches the ID of the path.

Tests passes for dvc in https://github.com/iterative/dvc-gdrive/pull/28

skshetry commented 1 year ago

Waiting for the tests to pass in https://github.com/iterative/dvc-gdrive/pull/28.

skshetry commented 1 year ago

@shcheklein, GDriveFileSystem does cache files id by default.

On fs.find("/root/<>/files/md5"), the base here will be /files/md5. find() checks for path starting with /files/md5 which does not exist.

https://github.com/iterative/PyDrive2/blob/489734477f767288418694d998ea4de95b11a04b/pydrive2/fs/spec.py#L484-L489

So it uses base here instead of self.base. I don't think it's worth it to save one API call here (which usually gets cached anyway).

shcheklein commented 1 year ago

okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?

shcheklein commented 1 year ago

okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?

may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix? (that would mean an extra cost of getting to the children 00 .... ff every time).

skshetry commented 1 year ago

may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix?

yes, we do find('files/md5'), and is what was broken. It's just one extra API call though, right? Which gets cached?

skshetry commented 1 year ago

It will be running an extra query on each find to fetch roots, right? may be even two?

For files/md5, that'll be just one. files/ is cached, so only an id for files/md5 will be fetched (which will be cached).

that would mean an extra cost of getting to the children 00 .... ff every time)

So, similarly here, fetching id for files/md5 will take one query, and a listing on files/md5 will take another. Both of them are cached, and won't be fetched again.

shcheklein commented 1 year ago

One cached initial call is fine. What happens next is when we are getting an extra call every single time. We need to run a query to get all chidren of the 'files/md'. That will be happening again and again unless i'm missing something (?).

shcheklein commented 1 year ago

We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.

skshetry commented 1 year ago

We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.

It should be just one extra query because on subsequent find() call, it'll use query id of all dir_ids matching files/md5 (so all prefixes should include that).

https://github.com/iterative/PyDrive2/blob/489734477f767288418694d998ea4de95b11a04b/pydrive2/fs/spec.py#L484-L489

shcheklein commented 1 year ago

It should be just one extra query because on subsequent find() call,

yep. which is not that bad - but still the same query again and again. I don't remember by now if we do that in parallel and rapidly (don't see a reason from the top of my head). The situation we want to avoid where that leads to 2x queries per second - that would be bad for us. If that's not the case- that's fine.

skshetry commented 1 year ago

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

We can think of using dircache in the future, similar to which is implemented in s3fs/gcsfs/adlfs.

shcheklein commented 1 year ago

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

no, here we are making an extra call now to get first the list of 00 ... ff under the files/md5. Before we were starting from 00 ... ff. That's the difference. At least from what I see, May be there is something else.

skshetry commented 1 year ago

I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.

no, here we are making an extra call now to get first the list of 00 ... ff under the files/md5. Before we were starting from 00 ... ff. That's the difference. At least from what I see, May be there is something else.

@shcheklein, on first find('files/md5') call, three things happen:

There is no id for files/md5. It is fetched.
Then, it tries to get dir id for path starting with files/md5. Only one exists which was recently fetched. So this is essentially only a files/md5 listing.
Listing of prefixes gets cached, and it recurses.

On subsequent find('files/md5') call, id for files/md5 is already cached, and the query id for path matching files/md5 does return all paths with prefixes.

skshetry commented 1 year ago

So at the end, subsequent find('files/md5') is just one query, same as before.

shcheklein commented 1 year ago

LGTM, @skshetry ! Let's merge it and release.

shcheklein commented 1 year ago

Thanks!

iterative / PyDrive2

fs.find: cache path ids #286