iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.72k stars 1.18k forks source link

queue start: worker doesn't process all experiments automatically #10562

Closed lefos99 closed 3 weeks ago

lefos99 commented 4 weeks ago

Bug Report

Description

When using dvc queue to queue multiple experiments, the worker picks up one or two experiments, runs them successfully, and then stops without processing the remaining experiments in the queue. I have to manually run dvc queue start again to resume processing the remaining queued experiments.

Example:

dvc queue status

Task     Name                                 Created       Status
9294987  tumor_dysp_weight_hsv_rgb_min_mean   Sep 19, 2024  Queued
22c83f3  tumor_dysp_weight_hsv_rgb_mean_mean  Sep 19, 2024  Queued
6f3c8e9  tumor_dysp_weight_hsv_rgb_min_min    Sep 19, 2024  Queued
f1ce1d2  tumor_dysp_weight_hard_min_mean      Sep 19, 2024  Queued
3775500  tumor_dysp_weight_hard_mean_mean     Sep 19, 2024  Success
10f8332  tumor_dysp_weight_hard_min_min       Sep 19, 2024  Success

Worker status: 0 active, 0 idle

If I run dvc queue start again it picks them up e.g.

dvc queue status

Task     Name                                 Created       Status
9294987  tumor_dysp_weight_hsv_rgb_min_mean   Sep 19, 2024  Running
22c83f3  tumor_dysp_weight_hsv_rgb_mean_mean  Sep 19, 2024  Queued
6f3c8e9  tumor_dysp_weight_hsv_rgb_min_min    Sep 19, 2024  Queued
f1ce1d2  tumor_dysp_weight_hard_min_mean      Sep 19, 2024  Success
3775500  tumor_dysp_weight_hard_mean_mean     Sep 19, 2024  Success
10f8332  tumor_dysp_weight_hard_min_min       Sep 19, 2024  Success

Worker status: 1 active, 0 idle

Reproduce

  1. Queue multiple experiments (e.g., 6 experiments) with dvc exp run --queue.
  2. Run dvc queue start to process the experiments.
  3. Observe that only a couple of experiments are processed, while others remain queued.
  4. Manually run dvc queue start again to continue processing the remaining experiments.

Expected

The worker should automatically continue to process all queued experiments without requiring manual intervention after the first batch.

Environment information

Output of dvc doctor:

DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.10.14 on Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.16.5
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.7
Supports:
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.9.0, boto3 = 1.35.16),
        ssh (sshfs = 2024.6.0)
Config:
        Global: /fastdata/users/lefos/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Caches: local
Remotes: local, s3
Workspace directory: ext4 on /dev/mapper/fastdatagroup-fastdatavolume
Repo: dvc (subdir), git
Repo.site_cache_dir: /var/tmp/dvc/repo/6a4a3cd6bb6194a5a62079d63dd05e18

Additional Information (if any):

shcheklein commented 3 weeks ago

Is there a small repo to reproduce this? How long does it take to run a single experiment? Does dvc exp show show them running by chance (can it be a reporting problem like this https://github.com/iterative/dvc/issues/10427 ?)

I've tried the following:

dvc.yaml:

stages:
  run:
    cmd: sleep 1; echo ${i} > metric.json
    outs:
      - metric.json

params.yaml:

i: 1

Running like:

dvc exp run --queue -S "i=range(0,10,1)"

then:

dvc queue start

And after a while, I'm getting all the experiments done.

Please, if you can reproduce it - try to come up with a simple repo that we could use - otherwise it's quite hard to understand why it is happening. Thanks!

lefos99 commented 3 weeks ago

Thanks for the reply, @shcheklein ! :smile_cat:

How long does it take to run a single experiment?

In order to run a single experiment it takes around 46 mins.

Does dvc exp show show them running by chance?

I didn't check dvc exp show, but dvc queue shows them in the queue, like the experiments are in the queue but the worker is not running. I also checked and the experiments indeed didn't run. (the expected output files were not there.)

dvc queue status

Task     Name                                 Created       Status
9294987  tumor_dysp_weight_hsv_rgb_min_mean   Sep 19, 2024  Queued
22c83f3  tumor_dysp_weight_hsv_rgb_mean_mean  Sep 19, 2024  Queued
6f3c8e9  tumor_dysp_weight_hsv_rgb_min_min    Sep 19, 2024  Queued
f1ce1d2  tumor_dysp_weight_hard_min_mean      Sep 19, 2024  Queued
3775500  tumor_dysp_weight_hard_mean_mean     Sep 19, 2024  Success
10f8332  tumor_dysp_weight_hard_min_min       Sep 19, 2024  Success

Worker status: 0 active, 0 idle

Please, if you can reproduce it - try to come up with a simple repo that we could use - otherwise it's quite hard to understand why it is happening. Thanks!

I totally understand, but the problem is that sometimes it happens and sometimes it is not. Nevertheless, I will try to reproduce it with a small repo and report back.

lefos99 commented 3 weeks ago

Actually, reproducing it was easier than I thought.

mkdir dvc_playground && cd dvc_playground
git init && dvc init && git add -A && git commit -m "First commit"
mkdir out data
echo lala >> data/dummy.txt && dvc add data/dummy.txt

echo "from time import perf_counter
from time import sleep
start = perf_counter()
print('Hello everyone')
sleep(2)
end = perf_counter()
print('I am done')
with open('out/how_much_time.txt', 'w') as f:
    f.write(str(f'Time has passed is {end-start}'))
" > run_train.py

echo "stages:
  train:
      cmd: python run_train.py
      deps:
        - data/dummy.txt
      outs: 
        - out/how_much_time.txt
"> dvc.yaml

git add -A && git commit -m 'Add DVC stuff'

Then one can run in order to remove old experiments in the queue and start 30 new experiments with 6 workers:

dvc queue remove --all && repeat 30 {dvc exp run -f --queue} && dvc queue start -j 6 && watch dvc queue status

What I observed is that the unexpected behavior not always happens, but it happens pretty often. I ran the above command 10 times and it failed 5/10 (50%). So I ended up with


Task     Name        Created    Status
46b886b  muddy-fare  12:47 PM   Queued
0741a83  sandy-bunk  12:47 PM   Queued
234edb7  nowed-pupa  12:47 PM   Queued
d419c3f  dorty-wain  12:47 PM   Queued
3ad8d1b  cynic-toff  12:47 PM   Queued
9bca7a1  beery-gyms  12:47 PM   Queued
5e41e1c  massy-ewes  12:47 PM   Queued
bfdf673  smoky-jeer  12:47 PM   Queued
81d3057  salic-sena  12:47 PM   Queued
c359d35  quack-runs  12:47 PM   Queued
2f715a1  trade-libs  12:47 PM   Queued
facc3ac  so-so-wage  12:47 PM   Queued
4f6950e  silty-hits  12:47 PM   Queued
39e29b0  gawky-merk  12:47 PM   Queued
771ca8c  fifty-dabs  12:47 PM   Queued
d86614a  weird-lure  12:47 PM   Queued
035dcd8  coaly-prob  12:47 PM   Queued
e6dc2b5  dolce-kaka  12:47 PM   Queued
f9439ae  gusty-daks  12:47 PM   Queued
df2fdf4  goosy-iron  12:47 PM   Queued
a4bf7ec  tangy-dees  12:47 PM   Queued
82f5770  natal-sand  12:47 PM   Queued
5f1401b  cedar-flip  12:47 PM   Queued
e6d98b4  mousy-sinh  12:47 PM   Queued
56116a0  dosed-knap  12:46 PM   Success
bf66773  pricy-buhl  12:46 PM   Success
0293e5f  sarky-snip  12:46 PM   Success
5833eb6  swish-heir  12:46 PM   Success
275e77a  hexed-cats  12:46 PM   Success
9c1bf71  hyoid-soul  12:47 PM   Success

Worker status: 0 active, 0 idle

I also managed to reproduce this behavior on 3 different machines.

shcheklein commented 3 weeks ago

Yep, I think it is really the same as https://github.com/iterative/dvc/issues/10427 as @nablabits has done an excellent research there. I see after the first run a bunch of messages left in the broker:

-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_db04a69e-e33c-4d26-9632-66031a0bac9f.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_4cb7a50e-739a-44bb-a110-15d371487f21.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_b9843cfa-d030-43f6-a257-4fabbb72030e.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_a73b40a0-0f35-4654-a1e1-d72bdad04fe4.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_ea17010c-08c9-4eb9-8199-0ee00cf50f59.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_c8b92dad-5f42-4587-a5bb-0cacae0c6980.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_b8dd4dc2-fbc9-4567-ba8c-5db8adcab2c4.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_d02287e4-d922-405e-b862-efa33f2c122a.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279054_faa62e23-582d-4c0f-ad6c-c665f935585e.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279055_1f0f890e-1df7-46aa-8640-1295b055c12d.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_c8db1eda-a0ad-4076-8174-d8ff08b42679.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_8e112e0d-a26d-4da1-8d7d-5051bd02aecb.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_532603b3-99b3-48cd-ab39-a715cd5482a6.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_51c237de-69bf-424a-ac1a-04654830fede.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_710efc24-7180-4166-95de-cd68131720e3.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_d1ad85ec-1cb3-4c6c-9c45-511535fa542a.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279082_1f04ba84-227d-49ae-81b6-b1148ba704e6.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279082_715cefcc-1949-42a2-baf9-d8d8008f67e9.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279083_b2ede287-0e7f-4de4-8a1d-19703c75faf3.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279083_db51942d-1413-47c2-8283-8be41cac4537.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   462B Sep 23 17:08 1079279083_07a9a385-9d51-4c60-8872-99681d5c99e4.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   463B Sep 23 17:08 1079279083_2fdec533-2a87-45ef-987f-04a0f1818ce6.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_f910a3d0-2f10-49de-8513-7c6488ed4219.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_bcccba75-5245-4dfe-a80c-121f3dc4ba46.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_4a878906-2e3d-4203-a2b4-95cb1570625d.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320217_03ced2d5-3e34-4ef2-88cb-56ca4aaa97b0.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320218_070cb665-5c5e-4e61-9ebf-5c98e6bfd267.dvc-exp-394d7a-5@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   677B Sep 23 17:09 1079320218_1e9d9b78-c16d-4ae5-b57e-019d89a64294.dvc-exp-394d7a-4@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322839_1c42780f-af91-4847-9272-e24a6a2c6263.dvc-exp-394d7a-1@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322839_4652a208-95b5-45c7-9689-de7c4b7aabd1.dvc-exp-394d7a-2@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322840_46d804a4-943a-4b52-a98a-221cd279e50c.dvc-exp-394d7a-3@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322840_b2bf062b-d473-4872-be13-90730e620b15.dvc-exp-394d7a-6@localhost.celery.pidbox.msg
-rw-r--r--   1 ivan  staff   678B Sep 23 17:09 1079322840_72714df1-101f-4996-b6ab-ddddc150aa00.dvc-exp-394d7a-5@localhost.celery.pidbox.msg

In the /Users/ivan/Projects/dvc_playground/.dvc/tmp/exps/celery/broker/in

each messages:

{"method": "shutdown", "arguments": {}, "destination": null, "pattern": null, "matcher": null}

so, next time it starts some workers can go down probably because of these messages.

lefos99 commented 3 weeks ago

Ah, nice reference and awesome work by @nablabits indeed.

By following the thread, using dvc exp clean seems to me like a nice workaround for now! :+1: I also confirmed it that running this command before, so basically:

dvc exp clean && dvc queue remove --all && repeat 30 {dvc exp run -f --queue} && dvc queue start -j 6 && watch dvc queue status

resolves the issue, which is quite nice already! :smile_cat:

shcheklein commented 3 weeks ago

Closing in favor of https://github.com/iterative/dvc/issues/10427 then. Thanks @lefos99 for confirming.