README example not working in Linux

bjg2 commented 5 years ago

Hi, I read rlgraph paper and wanted to give it a try, so I wanted to set up and run a few examples, but the first one breaks for me. Repro:

virtualenv -p python3 venv
source venv/bin/activate
pip install rlgraph
pip install rlgraph[ray]
pip install gym[atari]
pip install tensorflow-gpu
pip install psutil
pip install setproctitle

# Start ray on the head machine
ray start --head --redis-port 6379
# Optionally join to this cluster from other machines with ray start --redis-address=...

# Run script
python apex_pong.py

After ~1 minute it breaks with:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,84,84,4] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node prioritized-replay/memorynext_states/Assign (defined at /media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/spaces/box_space.py:192)  = Assign[T=DT_FLOAT, _grappler_relax_allocator_constraints=true, use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](prioritized-replay/memorynext_states, prioritized-replay/memorynext_states/Initializer/Const)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Am I doing anything wrong, or the default example is not working on Linux?

Info about my machine: OS: Ubuntu 18.04.1 LTS CPU: AMD ThreadRipper GPU: GeForce 1080ti RAM: 32gb VRAM: 11gb

Some screenshots:

michaelschaarschmidt commented 5 years ago

Hey,

many thanks for reporting this. We ran this with a V100 and I believe your GPU has less memory so it cannot allocate the prioritized replay variables.

Can you change:

"memory_spec": { "type": "prioritized_replay", "capacity": 10000 },

To "capacity": 1000 and try again?

This memory is not actually used because we update from external batches, but unless specified otherwise, the variables are default allocated on the GPU. We should probably change that configuration (if this is indeed the root cause), we just did not notice if due to using V100s.

bjg2 commented 5 years ago

I had to change this as well, but that did the trick, it was working:

  "observe_spec": {
    "buffer_size": 1000
  },

But again, after few minutes it died with:

Traceback (most recent call last):
  File "apex_pong.py", line 118, in <module>
    main(sys.argv)
  File "apex_pong.py", line 91, in main
    report_interval_min_seconds=30))
  File "/media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/execution/ray/ray_executor.py", line 177, in execute_workload
    worker_steps_executed, update_steps, discarded, queue_inserts = self._execute_step()
  File "/media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/rlgraph/execution/ray/apex/apex_executor.py", line 223, in _execute_step
    sampled_batch = ray.get(object_ids=replay_remote_task)
  File "/media/bjg/storage/code/rlgraph/venv2/lib/python3.6/site-packages/ray/worker.py", line 2211, in get
    raise value
ray.worker.RayTaskError: Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

Another question: during the run the only progress output was something like: 19-02-04 17:42:15:INFO:Executed 52224 Ray worker steps, 1536 update steps, (52224 of 2000000 (2.6112 %), discarded = 0, inserts = 272), is there some other info, like tensorboard events?

michaelschaarschmidt commented 5 years ago

The crash I believe may be a Ray error related to your cluster setup, e.g.

https://github.com/ray-project/ray/issues/3628 https://github.com/ray-project/ray/issues/3702

This is a stability problem in Ray itself which can relate to your Redis settings, number of workers or memory usage (see the related issues). Maybe you are running out of RAM? Try resizing

 "apex_replay_spec": {
        "memory_spec": {
          "capacity": 2000000,
      }
}

to something significantly smaller? Can also reduce the number of memory shards "num_replay_workers": 4, e.g. to 1?) I think for a single-node, 1 replay worker is definitely enough.

From experience, the new Ray backend seems to be not entirely stable when creating high throughput experiments, and it randomly fails occasionally. It seems they have not clearly identified the root cause of this.

In terms of output: To stdout, we only log this progress for performance reasons. After a workload is finished, you can request detailed stats for each worker from the executor (essentially performance for each episode on each worker). We could also pass loss values from the executor and log them.

Aside from that, we could of course look at summaries on the learner. What do you want to look at?

rlgraph / rlgraph

README example not working in Linux #49