Closed lefos99 closed 3 weeks ago
Is there a small repo to reproduce this? How long does it take to run a single experiment? Does dvc exp show
show them running by chance (can it be a reporting problem like this https://github.com/iterative/dvc/issues/10427 ?)
I've tried the following:
dvc.yaml
:
stages:
run:
cmd: sleep 1; echo ${i} > metric.json
outs:
- metric.json
params.yaml
:
i: 1
Running like:
dvc exp run --queue -S "i=range(0,10,1)"
then:
dvc queue start
And after a while, I'm getting all the experiments done.
Please, if you can reproduce it - try to come up with a simple repo that we could use - otherwise it's quite hard to understand why it is happening. Thanks!
Thanks for the reply, @shcheklein ! :smile_cat:
How long does it take to run a single experiment?
In order to run a single experiment it takes around 46 mins.
Does dvc exp show show them running by chance?
I didn't check dvc exp show
, but dvc queue
shows them in the queue, like the experiments are in the queue but the worker is not running. I also checked and the experiments indeed didn't run. (the expected output files were not there.)
dvc queue status
Task Name Created Status
9294987 tumor_dysp_weight_hsv_rgb_min_mean Sep 19, 2024 Queued
22c83f3 tumor_dysp_weight_hsv_rgb_mean_mean Sep 19, 2024 Queued
6f3c8e9 tumor_dysp_weight_hsv_rgb_min_min Sep 19, 2024 Queued
f1ce1d2 tumor_dysp_weight_hard_min_mean Sep 19, 2024 Queued
3775500 tumor_dysp_weight_hard_mean_mean Sep 19, 2024 Success
10f8332 tumor_dysp_weight_hard_min_min Sep 19, 2024 Success
Worker status: 0 active, 0 idle
Please, if you can reproduce it - try to come up with a simple repo that we could use - otherwise it's quite hard to understand why it is happening. Thanks!
I totally understand, but the problem is that sometimes it happens and sometimes it is not. Nevertheless, I will try to reproduce it with a small repo and report back.
Actually, reproducing it was easier than I thought.
mkdir dvc_playground && cd dvc_playground
git init && dvc init && git add -A && git commit -m "First commit"
mkdir out data
echo lala >> data/dummy.txt && dvc add data/dummy.txt
echo "from time import perf_counter
from time import sleep
start = perf_counter()
print('Hello everyone')
sleep(2)
end = perf_counter()
print('I am done')
with open('out/how_much_time.txt', 'w') as f:
f.write(str(f'Time has passed is {end-start}'))
" > run_train.py
echo "stages:
train:
cmd: python run_train.py
deps:
- data/dummy.txt
outs:
- out/how_much_time.txt
"> dvc.yaml
git add -A && git commit -m 'Add DVC stuff'
Then one can run in order to remove old experiments in the queue and start 30 new experiments with 6 workers:
dvc queue remove --all && repeat 30 {dvc exp run -f --queue} && dvc queue start -j 6 && watch dvc queue status
What I observed is that the unexpected behavior not always happens, but it happens pretty often. I ran the above command 10 times and it failed 5/10 (50%). So I ended up with
Task Name Created Status
46b886b muddy-fare 12:47 PM Queued
0741a83 sandy-bunk 12:47 PM Queued
234edb7 nowed-pupa 12:47 PM Queued
d419c3f dorty-wain 12:47 PM Queued
3ad8d1b cynic-toff 12:47 PM Queued
9bca7a1 beery-gyms 12:47 PM Queued
5e41e1c massy-ewes 12:47 PM Queued
bfdf673 smoky-jeer 12:47 PM Queued
81d3057 salic-sena 12:47 PM Queued
c359d35 quack-runs 12:47 PM Queued
2f715a1 trade-libs 12:47 PM Queued
facc3ac so-so-wage 12:47 PM Queued
4f6950e silty-hits 12:47 PM Queued
39e29b0 gawky-merk 12:47 PM Queued
771ca8c fifty-dabs 12:47 PM Queued
d86614a weird-lure 12:47 PM Queued
035dcd8 coaly-prob 12:47 PM Queued
e6dc2b5 dolce-kaka 12:47 PM Queued
f9439ae gusty-daks 12:47 PM Queued
df2fdf4 goosy-iron 12:47 PM Queued
a4bf7ec tangy-dees 12:47 PM Queued
82f5770 natal-sand 12:47 PM Queued
5f1401b cedar-flip 12:47 PM Queued
e6d98b4 mousy-sinh 12:47 PM Queued
56116a0 dosed-knap 12:46 PM Success
bf66773 pricy-buhl 12:46 PM Success
0293e5f sarky-snip 12:46 PM Success
5833eb6 swish-heir 12:46 PM Success
275e77a hexed-cats 12:46 PM Success
9c1bf71 hyoid-soul 12:47 PM Success
Worker status: 0 active, 0 idle
I also managed to reproduce this behavior on 3 different machines.
Yep, I think it is really the same as https://github.com/iterative/dvc/issues/10427 as @nablabits has done an excellent research there. I see after the first run a bunch of messages left in the broker:
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_db04a69e-e33c-4d26-9632-66031a0bac9f.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_4cb7a50e-739a-44bb-a110-15d371487f21.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_b9843cfa-d030-43f6-a257-4fabbb72030e.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_a73b40a0-0f35-4654-a1e1-d72bdad04fe4.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_ea17010c-08c9-4eb9-8199-0ee00cf50f59.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_c8b92dad-5f42-4587-a5bb-0cacae0c6980.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_b8dd4dc2-fbc9-4567-ba8c-5db8adcab2c4.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_d02287e4-d922-405e-b862-efa33f2c122a.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279054_faa62e23-582d-4c0f-ad6c-c665f935585e.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279055_1f0f890e-1df7-46aa-8640-1295b055c12d.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279082_c8db1eda-a0ad-4076-8174-d8ff08b42679.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 463B Sep 23 17:08 1079279082_8e112e0d-a26d-4da1-8d7d-5051bd02aecb.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279082_532603b3-99b3-48cd-ab39-a715cd5482a6.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 463B Sep 23 17:08 1079279082_51c237de-69bf-424a-ac1a-04654830fede.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279082_710efc24-7180-4166-95de-cd68131720e3.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 463B Sep 23 17:08 1079279082_d1ad85ec-1cb3-4c6c-9c45-511535fa542a.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279082_1f04ba84-227d-49ae-81b6-b1148ba704e6.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 463B Sep 23 17:08 1079279082_715cefcc-1949-42a2-baf9-d8d8008f67e9.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279083_b2ede287-0e7f-4de4-8a1d-19703c75faf3.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 463B Sep 23 17:08 1079279083_db51942d-1413-47c2-8283-8be41cac4537.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 462B Sep 23 17:08 1079279083_07a9a385-9d51-4c60-8872-99681d5c99e4.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 463B Sep 23 17:08 1079279083_2fdec533-2a87-45ef-987f-04a0f1818ce6.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 677B Sep 23 17:09 1079320217_f910a3d0-2f10-49de-8513-7c6488ed4219.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 677B Sep 23 17:09 1079320217_bcccba75-5245-4dfe-a80c-121f3dc4ba46.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 677B Sep 23 17:09 1079320217_4a878906-2e3d-4203-a2b4-95cb1570625d.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 677B Sep 23 17:09 1079320217_03ced2d5-3e34-4ef2-88cb-56ca4aaa97b0.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 677B Sep 23 17:09 1079320218_070cb665-5c5e-4e61-9ebf-5c98e6bfd267.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 677B Sep 23 17:09 1079320218_1e9d9b78-c16d-4ae5-b57e-019d89a64294.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 678B Sep 23 17:09 1079322839_1c42780f-af91-4847-9272-e24a6a2c6263.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 678B Sep 23 17:09 1079322839_4652a208-95b5-45c7-9689-de7c4b7aabd1.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 678B Sep 23 17:09 1079322840_46d804a4-943a-4b52-a98a-221cd279e50c.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 678B Sep 23 17:09 1079322840_b2bf062b-d473-4872-be13-90730e620b15.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r-- 1 ivan staff 678B Sep 23 17:09 1079322840_72714df1-101f-4996-b6ab-ddddc150aa00.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
In the /Users/ivan/Projects/dvc_playground/.dvc/tmp/exps/celery/broker/in
each messages:
{"method": "shutdown", "arguments": {}, "destination": null, "pattern": null, "matcher": null}
so, next time it starts some workers can go down probably because of these messages.
Ah, nice reference and awesome work by @nablabits indeed.
By following the thread, using dvc exp clean
seems to me like a nice workaround for now! :+1:
I also confirmed it that running this command before, so basically:
dvc exp clean && dvc queue remove --all && repeat 30 {dvc exp run -f --queue} && dvc queue start -j 6 && watch dvc queue status
resolves the issue, which is quite nice already! :smile_cat:
Closing in favor of https://github.com/iterative/dvc/issues/10427 then. Thanks @lefos99 for confirming.
Bug Report
Description
When using dvc queue to queue multiple experiments, the worker picks up one or two experiments, runs them successfully, and then stops without processing the remaining experiments in the queue. I have to manually run
dvc queue start
again to resume processing the remaining queued experiments.Example:
If I run
dvc queue start
again it picks them up e.g.Reproduce
dvc exp run --queue
.dvc queue start
to process the experiments.dvc queue start
again to continue processing the remaining experiments.Expected
The worker should automatically continue to process all queued experiments without requiring manual intervention after the first batch.
Environment information
Output of
dvc doctor
:Additional Information (if any):