iterative / PyDrive2

Google Drive API Python wrapper library. Maintained fork of PyDrive.
https://docs.iterative.ai/PyDrive2
Other
565 stars 70 forks source link

Error in pull/push "The query is too complex." #353

Closed ermolaev94 closed 1 month ago

ermolaev94 commented 1 month ago

Context

My project is huge, containing several different ML directions. Each direction has several pipilines desccribed in dvc.yaml. There are several different datasets, some of them more than 10Tb.

Issue has appeared when I tried to push particular experiment data. Command exp push tries to cope with all .dvc files in the repo. And I have several hundreds of them. Some of them are stored on one remote, while other stored at all remotes. Logic is different from task to task. Some of the .dvc are missing on all remotes and it's ok for me now, because they are still pushing (25Tb on S3).

The main remote is gdrive. It's a folder on team drive. Now its size ~5Tb and ~150k files. So we are not very close to the limits.

Error

Pushing experiment raises the following error:

$ dvc exp push origin rf_sgm_v6.01 -v                 
2024-07-11 17:23:36,273 DEBUG: v3.51.2 (pip), CPython 3.10.12 on Linux-6.8.0-35-generic-x86_64-with-glibc2.35                                               
2024-07-11 17:23:36,273 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc exp push origin rf_sgm_v6.01 -v --no-run-cache                           
2024-07-11 17:23:36,477 DEBUG: git push experiment ['refs/exps/f6/0bc58f3f1ee9bbe7842f164d360270fc677032/rf_sgm_v6.01:refs/exps/f6/0bc58f3f1ee9bbe7842f164d3
60270fc677032/rf_sgm_v6.01'] -> 'origin'                                                                                                                    
2024-07-11 17:23:38,032 DEBUG: dvc push experiment '[ExpRefInfo(baseline_sha='f60bc58f3f1ee9bbe7842f164d360270fc677032', name='rf_sgm_v6.01')]'             
Collecting                                                                                                                        |0.00 [00:00,    ?entry/s]
<...>
FileNotFoundError: [Errno 2] No such file or directory: '/data/projects/radml/DVC_CACHE//1c/bddf98ac99bbf716816b36811e3fb6.dir'                             

2024-07-11 17:29:57,551 DEBUG: Preparing to transfer data from '/data/projects/radml/DVC_CACHE/' to 'gdrive://1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL'            
2024-07-11 17:29:57,551 DEBUG: Preparing to collect status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL'                                                         
2024-07-11 17:29:57,552 DEBUG: Collecting status from '1GkI-Tgwdi1r2oat1E0wxFG3sXc6oL2UL'                                                                   
2024-07-11 17:29:57,555 DEBUG: Querying 1 oids via object_exists                                                                                            
2024-07-11 17:29:58,629 DEBUG: Querying 12 oids via object_exists                                                                                           
2024-07-11 17:30:01,965 DEBUG: Estimated remote size: 256 files                                                                                             
2024-07-11 17:30:01,967 DEBUG: Querying 45 oids via traverse                                                                                                
Pushing                                                                                                                                                     
2024-07-11 17:30:39,931 DEBUG: Studio token not found.                                                                                                      
Experiment rf_sgm_v6.01 is up to date on Git remote 'origin'.                                                                                               
2024-07-11 17:30:39,987 ERROR: failed to push cache: <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "The query is too com
plex.". Details: "[{'message': 'The query is too complex.', 'domain': 'global', 'reason': 'queryTooComplex', 'location': 'q', 'locationType': 'parameter'}]"
>                                                                                                                                                           
Traceback (most recent call last):                                                                                                                          
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/files.py", line 84, in _GetList                                            
    self.auth.service.files()                                                                                                                               
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper                       
    return wrapped(*args, **kwargs)                                                                                                                         
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute                                      
    raise HttpError(resp, content, uri=self.uri)                                                                                                            
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "The query is too complex.". Details: "[
{'message': 'The query is too complex.', 'domain': 'global', 'reason': 'queryTooComplex', 'location': 'q', 'locationType': 'parameter'}]">                  

During handling of the above exception, another exception occurred:
Traceback (most recent call last):                                                                                                                          
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/experiments/push.py", line 126, in push                                    
    result["uploaded"] = _push_cache(repo, pushed_refs_info, **kwargs)                                                                                      
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/experiments/push.py", line 182, in _push_cache                             
    return repo.push(                                                                                                                                       
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper                                          
    return f(repo, *args, **kwargs)                                                                                                                         
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/push.py", line 147, in push                                                
    push_transferred, push_failed = ipush(                                                                                                                  
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_data/index/push.py", line 76, in push                                           
    result = transfer(                                                                                                                                      
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer                               
    status = compare_status(                                                                                                                                
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 179, in compare_status                           
    dest_exists, dest_missing = status(                                                                                                                     
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_data/hashfile/status.py", line 151, in status                                   
    exists.update(odb.oids_exist(hashes, jobs=jobs, progress=pbar.callback))                                                                                
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/db.py", line 454, in oids_exist                                         
    return list(oids & set(wrap_iter(remote_oids, callback)))                                                                                               
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/db.py", line 35, in wrap_iter                                           
    for index, item in enumerate(iterable, start=1):                                                                                                        
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/db.py", line 346, in _list_oids_traverse                                
    yield from self._list_oids(prefixes=traverse_prefixes, jobs=jobs)                                                                                       
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/db.py", line 250, in _list_oids                                         
    for path in self._list_prefixes(prefixes=prefixes, jobs=jobs):                                                                                          
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/db.py", line 225, in _list_prefixes                                     
    yield from self.fs.find(paths, batch_size=jobs, prefix=prefix)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 529, in find
    yield from self.fs.find(path)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/fs/spec.py", line 490, in find
    for item in self._gdrive_list_ids(query_ids):
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/funcy/flow.py", line 99, in retry
    return call()
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/fs/spec.py", line 308, in <lambda>
    get_list = _gdrive_retry(lambda: next(file_list, None))
File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/apiattr.py", line 150, in __next__
    result = self._GetList()
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/auth.py", line 85, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/files.py", line 89, in _GetList
    raise ApiRequestError(error)
pydrive2.files.ApiRequestError: <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "The query is too complex.". Details: "[{'
message': 'The query is too complex.', 'domain': 'global', 'reason': 'queryTooComplex', 'location': 'q', 'locationType': 'parameter'}]">

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
    ret = cmd.do_run()
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
    return self.run()
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/commands/experiments/push.py", line 55, in run
    result = self.repo.experiments.push(
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/experiments/__init__.py", line 364, in push
    return push(self.repo, *args, **kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/scm_context.py", line 143, in run
    return method(repo, *args, **kw)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/experiments/push.py", line 134, in push
    raise UploadError("failed to push cache", result) from e
dvc.repo.experiments.push.UploadError: failed to push cache

2024-07-11 17:30:39,990 DEBUG: Analytics is enabled.
2024-07-11 17:30:40,043 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpp2q20wf1', '-v']
2024-07-11 17:30:40,049 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpp2q20wf1', '-v'] with pid 8052

Interesting that since that I've started getting the same error with even easy commands that used to work ealier:

(venv) ermolaev@df783b0a927d:~/projects/radml/cvl-cvisionrad-ml/ribs/data/small_datasets/fractures_0124_seg$ ls                                             
h5  h5-corrected  h5-processed  props  raw  raw.zip  raw.zip.dvc  README.md                                                                                 
(venv) ermolaev@df783b0a927d:~/projects/radml/cvl-cvisionrad-ml/ribs/data/small_datasets/fractures_0124_seg$ dvc pull raw.zip.dvc -v                        
2024-07-12 12:27:10,175 DEBUG: v3.51.2 (pip), CPython 3.10.12 on Linux-6.8.0-35-generic-x86_64-with-glibc2.35                                               
2024-07-12 12:27:10,175 DEBUG: command: /home/ermolaev/projects/radml/venv/bin/dvc pull raw.zip.dvc -v                                                      
2024-07-12 12:27:19,044 ERROR: unexpected error - <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "The query is too comple
x.". Details: "[{'message': 'The query is too complex.', 'domain': 'global', 'reason': 'queryTooComplex', 'location': 'q', 'locationType': 'parameter'}]">  
Traceback (most recent call last):                                                                                                                          
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/files.py", line 84, in _GetList                                            
    self.auth.service.files()                                                                                                                               
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper                       
    return wrapped(*args, **kwargs)                                                                                                                         
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/googleapiclient/http.py", line 938, in execute                                      
    raise HttpError(resp, content, uri=self.uri)                                                                                                            
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "The query is too complex.". Details: "[
{'message': 'The query is too complex.', 'domain': 'global', 'reason': 'queryTooComplex', 'location': 'q', 'locationType': 'parameter'}]">                  

During handling of the above exception, another exception occurred:                                                                                         

Traceback (most recent call last):                                                                                                                          
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main                                             
    ret = cmd.do_run()                                                                                                                                      
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run                                             
    return self.run()                                                                                                                                       
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 35, in run                                         
    stats = self.repo.pull(                                                                                                                                 
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper                                          
    return f(repo, *args, **kwargs)                                                                                                                         
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/pull.py", line 30, in pull
    processed_files_count = self.fetch( 
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/repo/fetch.py", line 139, in fetch
    self.stage_cache.pull(remote)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/stage/cache.py", line 290, in pull
    return self.transfer(odb, self.repo.cache.legacy)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/stage/cache.py", line 254, in transfer
    for src in from_fs.find(runs):
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 529, in find
    yield from self.fs.find(path)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/fs/spec.py", line 490, in find
    for item in self._gdrive_list_ids(query_ids):
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/funcy/flow.py", line 99, in retry
    return call()
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/fs/spec.py", line 308, in <lambda>
    get_list = _gdrive_retry(lambda: next(file_list, None))
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/apiattr.py", line 150, in __next__
    result = self._GetList()
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/auth.py", line 85, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/pydrive2/files.py", line 89, in _GetList
    raise ApiRequestError(error)
pydrive2.files.ApiRequestError: <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "The query is too complex.". Details: "[{'
message': 'The query is too complex.', 'domain': 'global', 'reason': 'queryTooComplex', 'location': 'q', 'locationType': 'parameter'}]">

2024-07-12 12:27:19,090 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-07-12 12:27:19,091 DEBUG: Removing '/home/ermolaev/projects/radml/.DFPsTQSNqOlOCuhay0gqrw.tmp'
2024-07-12 12:27:19,091 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2024-07-12 12:27:19,091 DEBUG: Removing '/home/ermolaev/projects/radml/.DFPsTQSNqOlOCuhay0gqrw.tmp'
2024-07-12 12:27:19,091 DEBUG: Removing '/home/ermolaev/projects/radml/.DFPsTQSNqOlOCuhay0gqrw.tmp'
2024-07-12 12:27:19,091 DEBUG: Removing '/data/projects/radml/DVC_CACHE/files/md5/.J4V_N7Jsrp5gB8I3Ply02A.tmp'
2024-07-12 12:27:19,095 DEBUG: Version info for developers:
DVC version: 3.51.2 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.8.0-35-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.1
        dvc_task = 0.4.0
        scmrepo = 3.3.6
Supports:
        gdrive (pydrive2 = 1.19.0),
        http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.6.1, boto3 = 1.34.131)
Config:
        Global: /home/ermolaev/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/sdc1
Caches: local
Remotes: gdrive, gdrive, gdrive, s3
Workspace directory: ext4 on /dev/sdb1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/7205a6ce3131e59a2db7211a94dd5faa

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2024-07-12 12:27:19,097 DEBUG: Analytics is enabled.
2024-07-12 12:27:19,156 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpd639or13', '-v']
2024-07-12 12:27:19,165 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpd639or13', '-v'] with pid 16730

I've tried to remove .cache/pydrive2 to reauth. It didn't help.

Additional Details

I've asked a question firstly, so other details can be seen here: https://discuss.dvc.org/t/error-in-pushing-experiments/2138/10

shcheklein commented 1 month ago

@ermolaev94 is it the same error if you run with dvc push --no-run-cache? (split dvc exp push into git push + dvc push as Dave suggested)?

File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/stage/cache.py", line 254, in transfer
    for src in from_fs.find(runs):
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 529, in find
    yield from self.fs.find(path)

here it is actually trying to push run cache (that AFAIU you don't need), so it' related to https://github.com/iterative/dvc/issues/10449 most likely

Also, an improvement can be made in PyDrive2 to split the files into batches when it runs a query to check their existence. I'll look into this.

shcheklein commented 1 month ago

@ermolaev94 could you check also the <google_drive>/<dvc_cache/runs directory? does it exists? how many subdirectories inside each subdirectory there?

shcheklein commented 1 month ago

@ermolaev94 also, could try this https://github.com/iterative/dvc/issues/10449#issuecomment-2229242286 please?

shcheklein commented 1 month ago

@ermolaev94 btw, could you please share some details (can DM ivan @ dvc.ai ) about your use case please - ~5Tb and ~150k files is quite impressive. I never thought it was possible with Google Drive tbh :). It would be great to catch up and lear more. Let me know or share your coordinates so that I can reach out.

ermolaev94 commented 1 month ago

@ermolaev94 is it the same error if you run with dvc push --no-run-cache? (split dvc exp push into git push + dvc push as Dave suggested)?

File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc/stage/cache.py", line 254, in transfer
    for src in from_fs.find(runs):
  File "/home/ermolaev/projects/radml/venv/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 529, in find
    yield from self.fs.find(path)

here it is actually trying to push run cache (that AFAIU you don't need), so it' related to iterative/dvc#10449 most likely

Also, an improvement can be made in PyDrive2 to split the files into batches when it runs a query to check their existence. I'll look into this.

Yes, it works! Thank you! You are right, looks like it' related to https://github.com/iterative/dvc/issues/10449

@ermolaev94 could you check also the <google_drive>/<dvc_cache/runs directory? does it exists? how many subdirectories inside each subdirectory there?

There are totally 702 files and 942 directories.

shcheklein commented 1 month ago

Closing a as duplicate for now of that issue. Thanks @ermolaev94 for confirming. We are looking into it.

ermolaev94 commented 1 month ago

Closing a as duplicate for now of that issue. Thanks @ermolaev94 for confirming. We are looking into it.

ok, thank you!

btw, I wrote you an email on box, that you've mentioned about our use case