google-deepmind / acme

A library of reinforcement learning components and agents
Apache License 2.0
3.43k stars 423 forks source link

Unable to load Facebook NetHack NLE #90

Open MauriceManning opened 3 years ago

MauriceManning commented 3 years ago

I would like to load the Facebook NetHack NLE into Acme but am getting and error:

I used the fairnle/nle:stable nle docker container, inside that container I installed Acme using these instructions: https://github.com/deepmind/acme#installation

Then I copied the examples/gym/run_d4pg.py example into the container, added the gym and nle import and used the NetHack-v0 environment in place of MountainCarContinuous-v0. I received this type mismatch which I initially thought was a NLE error but when I submitted to that team their respose was:

"This sounds like NLE produces a uint8 tensor where Acme expects an int32 tensor. I'm fairly certain our gym environment sets the right kind of tensor attributes. Perhaps you want to take this issue up with the Acme team?"

root@4deea15bc74f:/opt/acme# python ./run_d4pg.py 2020-11-23 01:59:06.463040: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-11-23 01:59:06.463107: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. I1123 01:59:09.143711 140349876913984 base.py:165] Created savedir: /opt/acme/nle_data/20201123-015909_kl1qo32y 2020-11-23 01:59:09.172180: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-11-23 01:59:09.173271: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303) 2020-11-23 01:59:09.174008: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (4deea15bc74f): /proc/driver/nvidia/version does not exist 2020-11-23 01:59:09.175436: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2020-11-23 01:59:09.188455: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2904000000 Hz 2020-11-23 01:59:09.189102: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564ed899deb0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-11-23 01:59:09.189157: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version [reverb/cc/platform/tfrecordcheckpointer.cc:143] Initializing TFRecordCheckpointer in /tmp/tmpy2tr1no [reverb/cc/platform/tfrecordcheckpointer.cc:322] Loading latest checkpoint from /tmp/tmpy2tr1no [reverb/cc/platform/default/server.cc:55] Started replay server on port 24693 WARNING:tensorflow:Entity <function _yield_value at 0x7fa5b78ee820> appears to be a generator function. It will not be converted by AutoGraph. W1123 01:59:12.369593 140349876913984 ag_logging.py:146] Entity <function _yield_value at 0x7fa5b78ee820> appears to be a generator function. It will not be converted by AutoGraph. Traceback (most recent call last): File "./run_d4pg.py", line 140, in app.run(main) File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "./run_d4pg.py", line 112, in main agent = d4pg.D4PG( File "/opt/conda/lib/python3.8/site-packages/acme/agents/tf/d4pg/agent.py", line 131, in init emb_spec = tf2_utils.create_variables(observation_network, [obs_spec]) File "/opt/conda/lib/python3.8/site-packages/acme/tf/utils.py", line 103, in create_variables dummy_output = network(add_batch_dim(dummy_input)) File "/opt/conda/lib/python3.8/site-packages/sonnet/src/utils.py", line 89, in _decorate_unbound_method return decorator_fn(bound_method, self, args, kwargs) File "/opt/conda/lib/python3.8/site-packages/sonnet/src/base.py", line 272, in wrap_with_name_scope return method(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/acme/tf/utils.py", line 144, in call return self._transformation(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/acme/tf/utils.py", line 54, in batch_concat return tf.concat(tree.flatten(flat_leaves), axis=-1) File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper return target(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py", line 1654, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1207, in concat_v2 _ops.raise_from_not_ok_status(e, name) File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute ConcatV2 as input #1(zero-based) was expected to be a int32 tensor but is a uint8 tensor [Op:ConcatV2] name: concat

edran commented 3 years ago

Hey there, NLE maintainer here.

FYI I'm trying to help on this issue on our side (tracked here: https://github.com/facebookresearch/nle/issues/49). I haven't properly looked at the d4pg example yet, but currently I'm assuming this is mostly a problem with adapting the example rather than with any of the packages.

However if you have any pointers in the meantime, please feel free to get in touch :)

MauriceManning commented 3 years ago

Has anyone on the Acme team (or just using Acme) had a chance to attempt to duplicate this issue? NLE looks like a great environment to experiment with the RL algos in Acme. thanks!

fastturtle commented 2 years ago

@MauriceManning it looks like one of the observations (or maybe an action) is a uint8, but is expected to be int32. Have done any further debugging on this?

JiaojiaoYe1994 commented 2 years ago

Hi everyone, I have met the similar problem of datatype, i.e. the model doesn't support uint8 data (however observation is uint8). It would be nice if someone could tell me how to solve this.

ethanluoyc commented 2 years ago

I think the issue here is that the default D4PG networks are unable to handle the observations returned by the NLE environment.

The observation space for the NLE environment (obtained from colab) is:

env = gym.make("NetHack-v0")
env.observation_space
# > Dict(blstats:Box(-2147483648, 2147483647, (26,), int64), chars:Box(0, 255, (21, 79), uint8), colors:Box(0, 15, (21, 79), uint8), glyphs:Box(0, 5976, (21, 79), int16), inv_glyphs:Box(0, 5976, (55,), int16), inv_letters:Box(0, 127, (55,), uint8), inv_oclasses:Box(0, 18, (55,), uint8), inv_strs:Box(0, 255, (55, 80), uint8), message:Box(0, 255, (256,), uint8), screen_descriptions:Box(0, 127, (21, 79, 80), uint8), specials:Box(0, 255, (21, 79), uint8), tty_chars:Box(0, 255, (24, 80), uint8), tty_colors:Box(0, 31, (24, 80), int8), tty_cursor:Box(0, 255, (2,), uint8))

There is a mix of different datatypes. The issue is quite old, but I believe the D4PG network likely tries to concatenate the observations in the dictionary and since tf.concat does not handle concatenation of tensors with different dtypes, it throws an error like originally mentioned here.

However, it's actually not very clear to me how the D4PG agent can be used for NetHack. IIUC, the action space of NLE environments is discrete, but the D4PG should only be used for tasks with continuous actions, so it's likely incompatible with the NLE tasks. In principle, something like IMPALA or DQN should work with NLE, but you still probably need a different network architecture from the default architecture to work with NLE environments.