Open shawn9995 opened 2 weeks ago
For some additional context, I have managed to produce a different error that occurs farther along in execution in an existing venv, but haven't been able to replicate it in a fresh venv. Notably occurs after the dreamer_model is created:
(DreamerV3 pid=16827) Install gputil for GPU system monitoring.
(DreamerV3 pid=16827) Model: "dreamer_model"
(DreamerV3 pid=16827) _________________________________________________________________
(DreamerV3 pid=16827) Layer (type) Output Shape Param #
(DreamerV3 pid=16827) =================================================================
(DreamerV3 pid=16827) world_model (WorldModel) multiple 0 (unused)
(DreamerV3 pid=16827) |¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|
(DreamerV3 pid=16827) | vector_encoder (MLP) multiple 1536 |
(DreamerV3 pid=16827) ||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||
(DreamerV3 pid=16827) || dense (Dense) multiple 1024 ||
(DreamerV3 pid=16827) || ||
...
...
(DreamerV3 pid=16827) | reward_layer_255buckets ( multiple 65535 |
(DreamerV3 pid=16827) | RewardPredictorLayer) |
(DreamerV3 pid=16827) ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
(DreamerV3 pid=16827) =================================================================
(DreamerV3 pid=16827) Total params: 3551238 (13.55 MB)
(DreamerV3 pid=16827) Trainable params: 3157509 (12.04 MB)
(DreamerV3 pid=16827) Non-trainable params: 393729 (1.50 MB)
(DreamerV3 pid=16827) _________________________________________________________________
Trial status: 1 RUNNING
Current time: 2024-11-06 14:09:41. Total running time: 30s
Logical resource usage: 1.0/12 CPUs, 0/0 GPUs
╭──────────────────────────────────────────────╮
│ Trial name status │
├──────────────────────────────────────────────┤
│ DreamerV3_CartPole-v1_5e43c_00000 RUNNING │
╰──────────────────────────────────────────────╯
Trial status: 1 RUNNING
Current time: 2024-11-06 14:10:11. Total running time: 1min 0s
Logical resource usage: 1.0/12 CPUs, 0/0 GPUs
╭──────────────────────────────────────────────╮
│ Trial name status │
├──────────────────────────────────────────────┤
│ DreamerV3_CartPole-v1_5e43c_00000 RUNNING │
╰──────────────────────────────────────────────╯
2024-11-06 14:10:30,548 ERROR tune_controller.py:1331 -- Trial task failed for trial DreamerV3_CartPole-v1_5e43c_00000
Traceback (most recent call last):
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/_private/worker.py", line 2661, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(KeyError): ray::DreamerV3.train() (pid=16827, ip=127.0.0.1, actor_id=725f37650ad104326a90233e01000000, repr=DreamerV3)
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 328, in train
result = self.step()
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 969, in step
self.env_runner_group.sync_env_runner_states(
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/rllib/env/env_runner_group.py", line 395, in sync_env_runner_states
self.local_env_runner.set_state(
File "/Users/markstephenson/avslab/.venv/lib/python3.10/site-packages/ray/rllib/algorithms/dreamerv3/utils/env_runner.py", line 555, in set_state
self.module.set_state(state[COMPONENT_RL_MODULE][DEFAULT_MODULE_ID])
KeyError: 'rl_module'
Associated package versions are here:
keras 2.15.0
ray 2.35.0
tensorflow 2.15.0
tensorflow-estimator 2.15.0
tensorflow-io-gcs-filesystem 0.37.1
tensorflow-macos 2.15.0
tensorflow-probability 0.23.0
The latter issue I encountered seems to be this unresolved one: https://github.com/ray-project/ray/issues/47527
What happened + What you expected to happen
I am trying to run a regression test on the cartpole example and am running into the issue below.
Versions / Dependencies
absl-py 2.1.0 aiosignal 1.3.1 astunparse 1.6.3 attrs 24.2.0 certifi 2024.8.30 charset-normalizer 3.4.0 click 8.1.7 cloudpickle 3.1.0 decorator 5.1.1 dm-tree 0.1.8 Farama-Notifications 0.0.4 filelock 3.16.1 flatbuffers 24.3.25 frozenlist 1.5.0 fsspec 2024.10.0 gast 0.6.0 google-pasta 0.2.0 grpcio 1.67.1 gymnasium 0.28.1 h5py 3.12.1 idna 3.10 imageio 2.36.0 jax-jumpy 1.0.0 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 keras 3.6.0 lazy_loader 0.4 libclang 18.1.1 lz4 4.3.3 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 3.0.2 mdurl 0.1.2 ml-dtypes 0.4.1 msgpack 1.1.0 namex 0.0.8 networkx 3.4.2 numpy 2.0.2 opt_einsum 3.4.0 optree 0.13.0 packaging 24.1 pandas 2.2.3 pillow 11.0.0 pip 24.1.2 protobuf 5.28.3 pyarrow 18.0.0 Pygments 2.18.0 python-dateutil 2.9.0.post0 pytz 2024.2 PyYAML 6.0.2 ray 2.38.0 referencing 0.35.1 requests 2.32.3 rich 13.9.4 rpds-py 0.21.0 scikit-image 0.24.0 scipy 1.14.1 setuptools 70.3.0 shellingham 1.5.4 six 1.16.0 tensorboard 2.18.0 tensorboard-data-server 0.7.2 tensorboardX 2.6.2.2 tensorflow 2.18.0 tensorflow-io-gcs-filesystem 0.37.1 tensorflow-probability 0.24.0 termcolor 2.5.0 tf_keras 2.18.0 tifffile 2024.9.20 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 urllib3 2.2.3 Werkzeug 3.1.2 wheel 0.44.0 wrapt 1.16.0
Reproduction script
Issue Severity
None