Open RaW-Git opened 1 month ago
Hi @shcheklein & @dberenbaum this issue looks a great learning opportunity, are you happy for me to pick it up —assuming @RaW-Git is not themself interested—?
@nablabits sure! please give it a try.
Hi @shcheklein I've tried this and it appears to me that we fixed it on 3.50.2, but it's not clear to me how (diff)
I tested this with the examples repository, this is what I did:
dvc exp run --queue --set-param "train.batch_size=16,24"
dvc queue start -j 2
Am I missing something?
@nablabits how do you install it? I mean DVC. Are you using a virtualenv, are you completely destroying it? Also, the repo state - are you running it on the same / clean state? Just to make sure.
@shcheklein well, I just cloned the examples repository and installed the requirements in a virtual environment as it's explained in the readme. Then, I ran through this section in the documentation to get familiar with the process.
After running it for the first time, I realised that I could try to reproduce the issue with dvc exp run --queue --set-param "train.batch_size=16,24"
finding that it was behaving ok for me.
So, after that, I demoted dvc to 3.50.1
just with pip install dvc==3.50.1
and checked that the issue was there. I upgraded the version again to the latest 3.51.2
just to double check and the issue was not there.
Looking at the diff between both tags didn't cast anything obvious to me so I set myself out to find the tag that solved the issue that was happily the next one 3.50.2
but maybe what you say about the repo cleanliness may have had something to do :thinking:
Let me know what you think, in the meantime I will run a check with a full clean repo pointing to 3.50.2
to rule out that scenario. :slightly_smiling_face:
Just a quick update on this: I have run a fair amount of experiments on the same version (3.50.1) finding that the issue sometimes whimsically appears and sometimes not. I'll keep investigating until I get to reproduce the error consistently.
Bug Report
Description
I have two experiments queued up in my dvc queue:
Now I do
dvc queue start -j 2
. The 2 experiments are running (I can see that via the CPU and GPU usages). Also thedvc queue status
shows them asrunning
. But the worker reported bydvc queue status
don't show up:Reproduce
Expected
See the active running workers.
Environment information
Output of
dvc doctor
:Additional Information (if any):