Hi I am running a distributed agent in Acme (/acme/examples/control/lp_local_d4pg.py) and I get the following error (run of out GPU memory)
I0522 20:03:38.349611 140210148575040 node.py:61] Reverb client connecting to: localhost:18200
Traceback (most recent call last):
File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 80, in <module>
app.run(main)
File "/home/lorenzo/acme/lib/python3.8/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/lorenzo/acme/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/process_entry.py", line 75, in main
functions[task_id]()
File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 71, in _construct_function
return functools.partial(self._function, *args, **kwargs)()
File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/courier/node.py", line 106, in run
instance = self._construct_instance() # pytype:disable=wrong-arg-types
File "/home/lorenzo/acme/lib/python3.8/site-packages/launchpad/nodes/python/node.py", line 164, in _construct_instance
return self._constructor(*args, **kwargs)
File "/home/lorenzo/git/acme/acme/agents/tf/d4pg/agent_distributed.py", line 149, in actor
networks = self._network_factory(self._environment_spec.actions)
File "/home/lorenzo/git/acme/acme/agents/tf/d4pg/agent_distributed.py", line 67, in wrapped_network_factory
networks_dict = network_factory(action_spec)
File "/home/lorenzo/git/planning_2d/test_speed.py", line 118, in make_networks
networks.DiscreteValuedHead(vmin, vmax, num_atoms),
File "/home/lorenzo/acme/lib/python3.8/site-packages/sonnet/src/base.py", line 126, in __call__
module.__init__(*args, **kwargs)
File "/home/lorenzo/git/acme/acme/tf/networks/distributional.py", line 63, in __init__
vmin = tf.convert_to_tensor(vmin)
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1430, in convert_to_tensor_v2_with_dispatch
return convert_to_tensor_v2(
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1436, in convert_to_tensor_v2
return convert_to_tensor(
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
return func(*args, **kwargs)
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1566, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 264, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 97, in convert_to_eager_tensor
ctx.ensure_initialized()
File "/home/lorenzo/acme/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 554, in ensure_initialized
context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
I tried on different hardware (rtx 1080 and rtx 1650) and I still get the same issue.
If I disable the GPU and run only CPU any number of programs can be spawned successfully instead. Obviously it's much slower though.
Any idea on what caused the error?
I am posting it here instead of acme because the issue seems related to Launchpad
Hi I am running a distributed agent in Acme (/acme/examples/control/lp_local_d4pg.py) and I get the following error (run of out GPU memory)
I tried on different hardware (rtx 1080 and rtx 1650) and I still get the same issue. If I disable the GPU and run only CPU any number of programs can be spawned successfully instead. Obviously it's much slower though.
Any idea on what caused the error?
I am posting it here instead of acme because the issue seems related to Launchpad