When executing dvc exp run --run-all, the worker hangs at some point (after finishing a small number of experiments, right before starting a new one). Once this happened after two experiments, now after one.
Reproduce
Add multiple experiments to the queue with dvc exp run --queue
dvc exp run --run-all
Expected
All experiments are executed.
Environment information
I'm running this through github actions on a self-hosted runner. ubuntu 22.04
cat .dvc/tmp/exps/celery/dvc-exp-worker-1.out gives me this:
/app/venv/lib/python3.11/site-packages/celery/platforms.py:829: SecurityWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!
Please specify a different user using the --uid option.
User information: uid=0 euid=0 gid=0 egid=0
warnings.warn(SecurityWarning(ROOT_DISCOURAGED.format(
[2024-05-15 16:42:53,662: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2024-05-15 16:42:53,671: WARNING/MainProcess] /app/venv/lib/python3.11/site-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(
[2024-05-15 16:42:53,671: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2024-05-15 16:42:53,671: INFO/MainProcess] Connected to filesystem://localhost//
[2024-05-15 16:42:53,673: INFO/MainProcess] dvc-exp-0b0771-1@localhost ready.
[2024-05-15 16:42:53,674: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[093a3dbc-da8f-4222-a839-e015a20dd6c2] received
[2024-05-15 20:05:07,673: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
[2024-05-15 20:26:58,967: CRITICAL/MainProcess] Unrecoverable error: JSONDecodeError('Expecting value: line 1 column 1 (char 0)')
Traceback (most recent call last):
File "/app/venv/lib/python3.11/site-packages/celery/worker/worker.py", line 202, in start
self.blueprint.start(self)
File "/app/venv/lib/python3.11/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/app/venv/lib/python3.11/site-packages/celery/bootsteps.py", line 365, in start
return self.obj.start()
^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 340, in start
blueprint.start(self)
File "/app/venv/lib/python3.11/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/app/venv/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 746, in start
c.loop(*c.loop_args())
File "/app/venv/lib/python3.11/site-packages/celery/worker/loops.py", line 130, in synloop
connection.drain_events(timeout=2.0)
File "/app/venv/lib/python3.11/site-packages/kombu/connection.py", line 341, in drain_events
return self.transport.drain_events(self.connection, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 997, in drain_events
get(self._deliver, timeout=timeout)
File "/app/venv/lib/python3.11/site-packages/kombu/utils/scheduling.py", line 55, in get
return self.fun(resource, callback, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 1035, in _drain_channel
return channel.drain_events(callback=callback, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 754, in drain_events
return self._poll(self.cycle, callback, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 414, in _poll
return cycle.get(callback)
^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/utils/scheduling.py", line 55, in get
return self.fun(resource, callback, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 417, in _get_and_deliver
message = self._get(queue)
^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/transport/filesystem.py", line 261, in _get
return loads(bytes_to_str(payload))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/venv/lib/python3.11/site-packages/kombu/utils/json.py", line 93, in loads
return _loads(s, object_hook=object_hook)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/init.py", line 359, in loads
return cls(kw).decode(s)
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Bug Report
Description
When executing
dvc exp run --run-all
, the worker hangs at some point (after finishing a small number of experiments, right before starting a new one). Once this happened after two experiments, now after one.Reproduce
Add multiple experiments to the queue with dvc exp run --queue dvc exp run --run-all
Expected
All experiments are executed.
Environment information
I'm running this through github actions on a self-hosted runner. ubuntu 22.04
Output of
dvc doctor
: DVC version: 3.50.2 (pip)Platform: Python 3.11.9 on Linux-5.15.0-107-generic-x86_64-with-glibc2.35 Subprojects: dvc_data = 3.15.1 dvc_objects = 5.1.0 dvc_render = 1.0.2 dvc_task = 0.4.0 scmrepo = 3.3.3 Supports: http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3), https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3), s3 (s3fs = 2024.3.1, boto3 = 1.34.69) Config: Global: /github/home/.config/dvc System: /etc/xdg/dvc Cache types: https://error.dvc.org/no-dvc-cache Caches: local Remotes: s3
Additional Information (if any):
cat .dvc/tmp/exps/celery/dvc-exp-worker-1.out
gives me this:/app/venv/lib/python3.11/site-packages/celery/platforms.py:829: SecurityWarning: You're running the worker with superuser privileges: this is absolutely not recommended!
Please specify a different user using the --uid option.
User information: uid=0 euid=0 gid=0 egid=0
warnings.warn(SecurityWarning(ROOT_DISCOURAGED.format( [2024-05-15 16:42:53,662: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost'
-------------- dvc-exp-0b0771-1@localhost v5.4.0 (opalescent) --- * ----- -- *** ---- Linux-5.15.0-107-generic-x86_64-with-glibc2.35 2024-05-15 16:42:53
[tasks] . dvc.repo.experiments.queue.tasks.cleanup_exp . dvc.repo.experiments.queue.tasks.collect_exp . dvc.repo.experiments.queue.tasks.run_exp . dvc.repo.experiments.queue.tasks.setup_exp . dvc_task.proc.tasks.run
[2024-05-15 16:42:53,671: WARNING/MainProcess] /app/venv/lib/python3.11/site-packages/celery/worker/consumer/consumer.py:508: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine whether broker connection retries are made during startup in Celery 6.0 and above. If you wish to retain the existing behavior for retrying connections on startup, you should set broker_connection_retry_on_startup to True. warnings.warn(
[2024-05-15 16:42:53,671: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost' [2024-05-15 16:42:53,671: INFO/MainProcess] Connected to filesystem://localhost// [2024-05-15 16:42:53,673: INFO/MainProcess] dvc-exp-0b0771-1@localhost ready. [2024-05-15 16:42:53,674: INFO/MainProcess] Task dvc.repo.experiments.queue.tasks.run_exp[093a3dbc-da8f-4222-a839-e015a20dd6c2] received [2024-05-15 20:05:07,673: WARNING/MainProcess] No hostname was supplied. Reverting to default 'localhost' [2024-05-15 20:26:58,967: CRITICAL/MainProcess] Unrecoverable error: JSONDecodeError('Expecting value: line 1 column 1 (char 0)') Traceback (most recent call last): File "/app/venv/lib/python3.11/site-packages/celery/worker/worker.py", line 202, in start self.blueprint.start(self) File "/app/venv/lib/python3.11/site-packages/celery/bootsteps.py", line 116, in start step.start(parent) File "/app/venv/lib/python3.11/site-packages/celery/bootsteps.py", line 365, in start return self.obj.start() ^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 340, in start blueprint.start(self) File "/app/venv/lib/python3.11/site-packages/celery/bootsteps.py", line 116, in start step.start(parent) File "/app/venv/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 746, in start c.loop(*c.loop_args()) File "/app/venv/lib/python3.11/site-packages/celery/worker/loops.py", line 130, in synloop connection.drain_events(timeout=2.0) File "/app/venv/lib/python3.11/site-packages/kombu/connection.py", line 341, in drain_events return self.transport.drain_events(self.connection, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 997, in drain_events get(self._deliver, timeout=timeout) File "/app/venv/lib/python3.11/site-packages/kombu/utils/scheduling.py", line 55, in get return self.fun(resource, callback, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 1035, in _drain_channel return channel.drain_events(callback=callback, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 754, in drain_events return self._poll(self.cycle, callback, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 414, in _poll return cycle.get(callback) ^^^^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/utils/scheduling.py", line 55, in get return self.fun(resource, callback, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/transport/virtual/base.py", line 417, in _get_and_deliver message = self._get(queue) ^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/transport/filesystem.py", line 261, in _get return loads(bytes_to_str(payload)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/venv/lib/python3.11/site-packages/kombu/utils/json.py", line 93, in loads return _loads(s, object_hook=object_hook) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/init.py", line 359, in loads return cls(kw).decode(s) ^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
very similar to https://github.com/iterative/dvc/issues/10398, though no solution was proposed there (as far as I can see)
The experiments that have run have been executed successfully.