Open bjg2 opened 5 years ago
Hey,
many thanks for reporting this. We ran this with a V100 and I believe your GPU has less memory so it cannot allocate the prioritized replay variables.
Can you change:
"memory_spec": { "type": "prioritized_replay", "capacity": 10000 },
To "capacity": 1000 and try again?
This memory is not actually used because we update from external batches, but unless specified otherwise, the variables are default allocated on the GPU. We should probably change that configuration (if this is indeed the root cause), we just did not notice if due to using V100s.
I had to change this as well, but that did the trick, it was working:
"observe_spec": {
"buffer_size": 1000
},
But again, after few minutes it died with:
Traceback (most recent call last):
File "apex_pong.py", line 118, in <module>
main(sys.argv)
File "apex_pong.py", line 91, in main
report_interval_min_seconds=30))
File "/media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/execution/ray/ray_executor.py", line 177, in execute_workload
worker_steps_executed, update_steps, discarded, queue_inserts = self._execute_step()
File "/media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/execution/ray/apex/apex_executor.py", line 223, in _execute_step
sampled_batch = ray.get(object_ids=replay_remote_task)
File "/media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/ray/worker.py", line 2211, in get
raise value
ray.worker.RayTaskError: Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.
Another question: during the run the only progress output was something like: 19-02-04 17:42:15:INFO:Executed 52224 Ray worker steps, 1536 update steps, (52224 of 2000000 (2.6112 %), discarded = 0, inserts = 272)
, is there some other info, like tensorboard events?
The crash I believe may be a Ray error related to your cluster setup, e.g.
https://github.com/ray-project/ray/issues/3628 https://github.com/ray-project/ray/issues/3702
This is a stability problem in Ray itself which can relate to your Redis settings, number of workers or memory usage (see the related issues). Maybe you are running out of RAM? Try resizing
"apex_replay_spec": {
"memory_spec": {
"capacity": 2000000,
}
}
to something significantly smaller? Can also reduce the number of memory shards "num_replay_workers": 4
, e.g. to 1?) I think for a single-node, 1 replay worker is definitely enough.
From experience, the new Ray backend seems to be not entirely stable when creating high throughput experiments, and it randomly fails occasionally. It seems they have not clearly identified the root cause of this.
In terms of output: To stdout, we only log this progress for performance reasons. After a workload is finished, you can request detailed stats for each worker from the executor (essentially performance for each episode on each worker). We could also pass loss values from the executor and log them.
Aside from that, we could of course look at summaries on the learner. What do you want to look at?
Hi, I read rlgraph paper and wanted to give it a try, so I wanted to set up and run a few examples, but the first one breaks for me. Repro:
After ~1 minute it breaks with:
Am I doing anything wrong, or the default example is not working on Linux?
Info about my machine: OS: Ubuntu 18.04.1 LTS CPU: AMD ThreadRipper GPU: GeForce 1080ti RAM: 32gb VRAM: 11gb
Some screenshots: