iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

`dvc pull`: Config field `jobs` in `.dvc/config` not taken into account as expected #7730

Open stefan-hartmann-lgs opened 2 years ago

stefan-hartmann-lgs commented 2 years ago

Bug Report

dvc pull: Config field jobs in .dvc/config not taken into account as expected

Description

dvc pull is still using cpu_count() * 4 download jobs even if jobs=4 is defined under the remote in use in .dvc/config.

Furthermore, the jobs=4 still seems to have an effect, as the dvc pull otherwise often fails with:

image

With the jobs=4 config that problem seems to be history so it seems to have an effect somehow.

Maybe this is not really clear from the documentation!?

Reproduce

  1. dvc init
  2. Edit .dvc/config file and add a remote - for instance:
[core]
    remote = artifactory
['remote "artifactory"']
    url = https://my.artifactory.com/...
    auth = basic
    method = PUT
    jobs = 4
  1. Run dvc pull
  2. See that more than 4 download jobs are running in parallel

Expected

It was expected that the jobs=4 config in .dvc/config file does the same as running dvc pull --jobs 4 which is not the case.

Environment information

DVC 2.10.2 on Windows.

Output of dvc doctor:

$ dvc doctor

DVC version: 2.10.2 (exe)
---------------------------------
Platform: Python 3.8.10 on Windows-10-10.0.19042-SP0
Supports:
        azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2022.3.0),
        hdfs (fsspec = 2022.3.0, pyarrow = 8.0.0),
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
        ssh (sshfs = 2022.3.1),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: https
Workspace directory: NTFS on C:\
Repo: dvc, git
daavoo commented 2 years ago

This is not currently supported (see available config fields for remote section in https://dvc.org/doc/command-reference/config#remote).

Is there any particular part of the documentation that led you to:

It was expected that the jobs=4 config in .dvc/config file does the same as running dvc pull --jobs 4 which is not the case.
dberenbaum commented 2 years ago

@daavoo That links to https://dvc.org/doc/command-reference/remote/modify#available-parameters-for-all-remotes, which does show jobs as an option. Am I misunderstanding?

daavoo commented 2 years ago

🙏 ignore muy previous comment

stefan-hartmann-lgs commented 2 years ago

@daavoo That links to https://dvc.org/doc/command-reference/remote/modify#available-parameters-for-all-remotes, which does show jobs as an option. Am I misunderstanding?

Yes exactly. The jobs in .dvc/config does something but not the same what --jobs does an that was expected as stated in the documentation.

daavoo commented 2 years ago

Hi @stefan-hartmann-lgs ! How are you checking the See that more than 4 download jobs are running in parallel??

I have tried and the jobs config section is honored when instantiating the filesystem and passed down to the transferring task :

https://github.com/iterative/dvc/blob/f23d31af644ab4ad4492b9cfa1000d58420c238d/dvc/fs/__init__.py#L94-L110

Could you try sharing the output profile from:

pip install viztracer
dvc pull --viztracer-depth 8
stefan-hartmann-lgs commented 2 years ago

Hi @daavoo

Sorry for laaaaate reply - I can see that more than 4 jobs are running in my console (even if jobs=4 in .dvc/config).

As stated above I would have expected that only 4 jobs are running but I see 32 running (see screenshot below)

image

karajan1001 commented 2 years ago

It's better to use dvc config -l instead of type type .dvc/config. for there might be some hidden things in .dvc/config.local

stefan-hartmann-lgs commented 2 years ago

@daavoo any infos on this one?

dberenbaum commented 2 years ago

Ping @daavoo