iterative / dvc

🦉 ML Experiments and Data Management with Git
https://dvc.org
Apache License 2.0
13.36k stars 1.16k forks source link

dvc queue status doesn't report active workers #10427

Open RaW-Git opened 1 month ago

RaW-Git commented 1 month ago

Bug Report

Description

I have two experiments queued up in my dvc queue:

Screenshot 2024-05-14 at 12 31 33

Now I do dvc queue start -j 2. The 2 experiments are running (I can see that via the CPU and GPU usages). Also the dvc queue status shows them as running. But the worker reported by dvc queue status don't show up:

Screenshot 2024-05-14 at 12 31 40

Reproduce

  1. dvc exp run --queue --name exp-1
  2. dvc exp run --queue --name exp-2
  3. dvc queue start -j 2
  4. dvc queue status

Expected

See the active running workers.

Environment information

Output of dvc doctor:

DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.11.0rc1 on Linux-5.15.0-105-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.15.1
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.2
Supports:
        azure (adlfs = 2024.4.1, knack = 0.11.0, azure-identity = 1.16.0),
        http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3)
Config:
        Global: /home/raw/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: ext4 on /dev/sdb1
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/nvme0n1p3
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1c67604caf3156e9fea7df37dda80d5f

Additional Information (if any):

nablabits commented 2 weeks ago

Hi @shcheklein & @dberenbaum this issue looks a great learning opportunity, are you happy for me to pick it up —assuming @RaW-Git is not themself interested—?

shcheklein commented 2 weeks ago

@nablabits sure! please give it a try.

nablabits commented 2 weeks ago

Hi @shcheklein I've tried this and it appears to me that we fixed it on 3.50.2, but it's not clear to me how (diff)

I tested this with the examples repository, this is what I did:

Am I missing something?

shcheklein commented 2 weeks ago

@nablabits how do you install it? I mean DVC. Are you using a virtualenv, are you completely destroying it? Also, the repo state - are you running it on the same / clean state? Just to make sure.

nablabits commented 2 weeks ago

@shcheklein well, I just cloned the examples repository and installed the requirements in a virtual environment as it's explained in the readme. Then, I ran through this section in the documentation to get familiar with the process.

After running it for the first time, I realised that I could try to reproduce the issue with dvc exp run --queue --set-param "train.batch_size=16,24" finding that it was behaving ok for me.

So, after that, I demoted dvc to 3.50.1 just with pip install dvc==3.50.1 and checked that the issue was there. I upgraded the version again to the latest 3.51.2 just to double check and the issue was not there.

Looking at the diff between both tags didn't cast anything obvious to me so I set myself out to find the tag that solved the issue that was happily the next one 3.50.2 but maybe what you say about the repo cleanliness may have had something to do :thinking:

Let me know what you think, in the meantime I will run a check with a full clean repo pointing to 3.50.2 to rule out that scenario. :slightly_smiling_face:

nablabits commented 1 week ago

Just a quick update on this: I have run a fair amount of experiments on the same version (3.50.1) finding that the issue sometimes whimsically appears and sometimes not. I'll keep investigating until I get to reproduce the error consistently.