Closed skshetry closed 1 year ago
Waiting for the tests to pass in https://github.com/iterative/dvc-gdrive/pull/28.
@shcheklein, GDriveFileSystem
does cache files
id by default.
On fs.find("/root/<>/files/md5")
, the base here will be /files/md5
. find()
checks for path starting with /files/md5
which does not exist.
So it uses base
here instead of self.base
. I don't think it's worth it to save one API call here (which usually gets cached anyway).
okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?
okay, one more consideration here - it's still not optimal for DVC I think. It will be running an extra query on each find to fetch roots, right? may be even two?
may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix? (that would mean an extra cost of getting to the children 00
.... ff
every time).
may be also not an issue, depends on how we feed things to it ... do we ever ask find('files/md5')? w/o any prefix?
yes, we do find('files/md5')
, and is what was broken. It's just one extra API call though, right? Which gets cached?
It will be running an extra query on each find to fetch roots, right? may be even two?
For files/md5
, that'll be just one. files/
is cached, so only an id for files/md5
will be fetched (which will be cached).
that would mean an extra cost of getting to the children 00 .... ff every time)
So, similarly here, fetching id for files/md5
will take one query, and a listing on files/md5
will take another. Both of them are cached, and won't be fetched again.
One cached initial call is fine. What happens next is when we are getting an extra call every single time. We need to run a query to get all chidren of the 'files/md'. That will be happening again and again unless i'm missing something (?).
We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.
We cache only id to name (path) and back. We don't cache query results (like the list of subdirectories) afaiu.
It should be just one extra query because on subsequent find()
call, it'll use query id of all dir_ids matching files/md5
(so all prefixes should include that).
It should be just one extra query because on subsequent find() call,
yep. which is not that bad - but still the same query again and again. I don't remember by now if we do that in parallel and rapidly (don't see a reason from the top of my head). The situation we want to avoid where that leads to 2x queries per second - that would be bad for us. If that's not the case- that's fine.
I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.
We can think of using dircache in the future, similar to which is implemented in s3fs/gcsfs/adlfs.
I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.
no, here we are making an extra call now to get first the list of 00
... ff
under the files/md5
. Before we were starting from 00
... ff
. That's the difference. At least from what I see, May be there is something else.
I am not sure I understand. That was the same case before too. We used to query for union of all prefixes over and over again.
no, here we are making an extra call now to get first the list of
00
...ff
under thefiles/md5
. Before we were starting from00
...ff
. That's the difference. At least from what I see, May be there is something else.
@shcheklein, on first find('files/md5')
call, three things happen:
files/md5
. It is fetched.files/md5
. Only one exists which was recently fetched. So this is essentially only a files/md5
listing.On subsequent find('files/md5')
call, id for files/md5
is already cached, and the query id for path matching files/md5
does return all paths with prefixes.
So at the end, subsequent find('files/md5')
is just one query, same as before.
LGTM, @skshetry ! Let's merge it and release.
Thanks!
GDriveFileSystem was previously caching dir ids of root, and was using those on
fs.find()
. This worked well when the remote cache was at the root, but now since dvc uses/files/md5/
by default, the dir ids are no longer in the cache andfind
ends up returning an empty list.This PR checks if the path is cached, and if not, it caches the ID of the path.
Tests passes for dvc in https://github.com/iterative/dvc-gdrive/pull/28