google-deepmind / launchpad

Apache License 2.0
309 stars 35 forks source link

Run out of GPU memory when using Launchpad #8

Closed Idate96 closed 3 years ago

Idate96 commented 3 years ago

Hi I am running a distributed agent in Acme (/acme/examples/control/lp_local_d4pg.py) and I get the following error (run of out GPU memory)

I0522 20:03:38.349611 140210148575040 node.py:61] Reverb client connecting to: localhost:18200
Traceback (most recent call last):
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 80, in <module>
    app.run(main)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 75, in main
    functions[task_id]()
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 71, in _construct_function
    return functools.partial(self._function, *args, **kwargs)()
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 106, in run
    instance = self._construct_instance()  # pytype:disable=wrong-arg-types
  File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 164, in _construct_instance
    return self._constructor(*args, **kwargs)
  File "/home/lorenzo/git/acme/acme/agents/tf/d4pg/agent_distributed.py", line 149, in actor
    networks = self._network_factory(self._environment_spec.actions)
  File "/home/lorenzo/git/acme/acme/agents/tf/d4pg/agent_distributed.py", line 67, in wrapped_network_factory
    networks_dict = network_factory(action_spec)
  File "/home/lorenzo/git/planning_2d/test_speed.py", line 118, in make_networks
    networks.DiscreteValuedHead(vmin, vmax, num_atoms),
  File "/home/lorenzo/acme/lib/python3.8/site-packages/sonnet/src/base.py", line 126, in __call__
    module.__init__(*args, **kwargs)
  File "/home/lorenzo/git/acme/acme/tf/networks/distributional.py", line 63, in __init__
    vmin = tf.convert_to_tensor(vmin)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
    return convert_to_tensor_v2(
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
    return convert_to_tensor(
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
    return func(*args, **kwargs)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
    return constant_op.constant(value, dtype, name=name)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 264, in constant
    return _constant_impl(value, dtype, shape, name, verify_shape=False,
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl
    return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 97, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 554, in ensure_initialized
    context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory

I tried on different hardware (rtx 1080 and rtx 1650) and I still get the same issue. If I disable the GPU and run only CPU any number of programs can be spawned successfully instead. Obviously it's much slower though.

Any idea on what caused the error?

I am posting it here instead of acme because the issue seems related to Launchpad

Idate96 commented 3 years ago

The issue has been solved, check https://github.com/deepmind/acme/issues/121. The problem is caused by the way TF manages GPU memory.