Issues with running code

trevormcinroe commented 2 years ago

Hi,

The provided code errors out with the following: python run.py --method tia --configs dmc --task dmc_cartpole_swingup_none --logdir ./

I believe the below problem occurs in the call to self.train(next(self._dataset)) during the initialization of the SeparationDreamer class. This is found on line 497 here: https://github.com/kyonofx/tia/blob/main/Dreamer/dreamers.py#L497

Below is the full error:

2022-08-18 18:37:54.850774: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-08-18 18:37:54.851375: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2022-08-18 18:37:54.865525: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-18 18:37:54.865668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2022-08-18 18:37:54.865690: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-08-18 18:37:54.867835: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2022-08-18 18:37:54.867877: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2022-08-18 18:37:54.868565: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2022-08-18 18:37:54.868742: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2022-08-18 18:37:54.868858: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/cv2/../../lib64::/home/lukas/.mujoco/mujoco200/bin:/home/lukas/.mujoco/mujoco210/bin:/usr/lib/nvidia
2022-08-18 18:37:54.869412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-08-18 18:37:54.869517: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-08-18 18:37:54.869527: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2022-08-18 18:37:54.870744: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-08-18 18:37:54.870772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-08-18 18:37:54.870778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      
2022-08-18 18:37:56.005765: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_2"
op: "FlatMapDataset"
input: "TensorDataset/_1"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_flat_map_fn_65"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
        dim {
          size: 1
        }
      }
      shape {
        dim {
          size: -1
        }
      }
      shape {
        dim {
          size: -1
        }
        dim {
          size: 64
        }
        dim {
          size: 64
        }
        dim {
          size: 3
        }
      }
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_HALF
      type: DT_HALF
      type: DT_UINT8
      type: DT_HALF
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
2022-08-18 18:37:56.015864: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-08-18 18:37:56.016179: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2599990000 Hz
2022-08-18 18:38:15.379381: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py:22: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/keras_preprocessing/image/utils.py:23: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  'nearest': pil_image.NEAREST,
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/keras_preprocessing/image/utils.py:24: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  'bilinear': pil_image.BILINEAR,
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/keras_preprocessing/image/utils.py:25: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  'bicubic': pil_image.BICUBIC,
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/keras_preprocessing/image/utils.py:28: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
  if hasattr(pil_image, 'HAMMING'):
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/keras_preprocessing/image/utils.py:30: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
  if hasattr(pil_image, 'BOX'):
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/keras_preprocessing/image/utils.py:33: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
  if hasattr(pil_image, 'LANCZOS'):
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/__init__.py:61: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if (distutils.version.LooseVersion(tf.__version__) <
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/__init__.py:61: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if (distutils.version.LooseVersion(tf.__version__) <
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/gym/envs/registration.py:441: UserWarning: [33mWARN: The `registry.env_specs` property along with `EnvSpecTree` is deprecated. Please use `registry` directly as a dictionary instead.[0m
  "The `registry.env_specs` property along with `EnvSpecTree` is deprecated. Please use `registry` directly as a dictionary instead."
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/gym/spaces/box.py:128: UserWarning: [33mWARN: Box bound precision lowered by casting to float32[0m
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/gym/core.py:330: DeprecationWarning: [33mWARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.[0m
  "Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/gym/wrappers/step_api_compatibility.py:40: DeprecationWarning: [33mWARN: Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.[0m
  "Initializing environment in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future."
/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/gym/envs/registration.py:441: UserWarning: [33mWARN: The `registry.env_specs` property along with `EnvSpecTree` is deprecated. Please use `registry` directly as a dictionary instead.[0m
  "The `registry.env_specs` property along with `EnvSpecTree` is deprecated. Please use `registry` directly as a dictionary instead."
Logdir dmc_cartpole_swingup_none/tia/0
Prefill dataset with 0 steps.
Simulating agent for 995000 steps.
Found 253201 disen_reward parameters.
Traceback (most recent call last):
  File "run.py", line 121, in <module>
    main(args.method, parser.parse_args(remaining))
  File "run.py", line 84, in main
    agent = DreamerModel(config, datadir, actspace, writer)
  File "/home/lukas/Documents/University/PhD/Research/ksl/experiments/tia/Dreamer/dreamers.py", line 305, in __init__
    super().__init__(config, datadir, actspace, writer)   
  File "/home/lukas/Documents/University/PhD/Research/ksl/experiments/tia/Dreamer/dreamers.py", line 52, in __init__
    self._build_model()
  File "/home/lukas/Documents/University/PhD/Research/ksl/experiments/tia/Dreamer/dreamers.py", line 498, in _build_model
    self.train(next(self._dataset))
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 871, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 726, in _initialize
    *args, **kwds))
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2969, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3361, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3206, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 990, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 634, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3887, in bound_method_wrapper
    return wrapped_fn(*args, **kwargs)
  File "/home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 977, in wrapper
    raise e.ag_error_metadata.to_exception(e)
AttributeError: in user code:

    /home/lukas/Documents/University/PhD/Research/ksl/experiments/tia/Dreamer/dreamers.py:101 train  *
        self._train(data, log_images)
    /home/lukas/Documents/University/PhD/Research/ksl/experiments/tia/Dreamer/dreamers.py:392 _train  *
        imag_feat = self._imagine_ahead(post)
    /home/lukas/Documents/University/PhD/Research/ksl/experiments/tia/Dreamer/dreamers.py:231 policy  *
        tf.stop_gradient(self._dynamics.get_feat(state))).sample()
    /home/lukas/Documents/University/PhD/Research/ksl/experiments/tia/Dreamer/tools.py:411 static_scan  *
        last = fn(last, inp)
    /tmp/tmpwxupo2rk.py:53 <lambda>  **
        states = ag__.converted_call(ag__.ld(tools).static_scan, (ag__.autograph_artifact((lambda prev, _: ag__.converted_call(ag__.ld(self)._dynamics.img_step, (ag__.ld(prev), ag__.converted_call(ag__.ld(policy), (ag__.ld(prev),), None, fscope)), None, fscope))), ag__.converted_call(ag__.ld(tf).range, (ag__.ld(self)._c.horizon,), None, fscope), ag__.ld(start)), None, fscope)
    /tmp/tmpwxupo2rk.py:48 policy  **
        retval__2 = ag__.converted_call(ag__.converted_call(ag__.ld(self)._actor, (ag__.converted_call(ag__.ld(tf).stop_gradient, (ag__.converted_call(ag__.ld(self)._dynamics.get_feat, (ag__.ld(state),), None, fscope_2),), None, fscope_2),), None, fscope_2).sample, (), None, fscope_2)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/distributions/distribution.py:1002 sample  **
        return self._call_sample_n(sample_shape, seed, name, **kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/distributions/distribution.py:980 _call_sample_n
        n, seed=seed() if callable(seed) else seed, **kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/distributions/independent.py:249 _sample_n
        return self.distribution.sample(sample_shape=n, seed=seed, **kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/distributions/distribution.py:1002 sample
        return self._call_sample_n(sample_shape, seed, name, **kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/distributions/transformed_distribution.py:345 _call_sample_n
        y = self.bijector.forward(x, **bijector_kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/bijectors/bijector.py:939 forward
        return self._call_forward(x, name, **kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/bijectors/bijector.py:921 _call_forward
        return self._cache.forward(x, **kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/internal/cache_util.py:338 forward
        return self._lookup(x, self._forward_name, self._inverse_name, **kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/internal/cache_util.py:489 _lookup
        input, forward_name, kwargs)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/internal/cache_util.py:524 _get_or_create_edge
        callback=self.storage.maybe_del)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/internal/cache_util.py:141 __init__
        self._hash = hash(hashable_structure((self._struct, self._subkey)))
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/internal/cache_util.py:54 hashable_structure
        for k, v in nest.flatten_with_tuple_paths(struct))
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/internal/cache_util.py:54 <genexpr>
        for k, v in nest.flatten_with_tuple_paths(struct))
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/internal/cache_util.py:48 make_hashable
        hash(obj)
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/bijectors/bijector.py:656 __hash__
        type(self), self._get_parameterization())))
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/bijectors/bijector.py:692 _get_parameterization
        return self.parameters
    /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/tensorflow_probability/python/bijectors/bijector.py:651 parameters
        return {k: v for k, v in self._parameters.items()

    AttributeError: 'NoneType' object has no attribute 'items'

2022-08-18 18:38:19.964977: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
     [[{{node PyFunc}}]]

We attempted to run the above script on two different systems, but both gave the same error. For reference, one system is running cuda 11.6 and the other cuda 11.4.

Also, below is the package list we are using:

Package                      Version
---------------------------- ---------
absl-py                      0.15.0
advance-touch                1.0.2
appdirs                      1.4.4
astunparse                   1.6.3
brotab                       1.3.0
cached-property              1.5.2
cachetools                   5.2.0
certifi                      2022.6.15
charset-normalizer           2.1.0
clang                        5.0
cloudpickle                  2.1.0
cycler                       0.11.0
decorator                    5.1.1
dm-control                   1.0.5
dm-env                       1.5
dm-tree                      0.1.7
docopt                       0.6.2
flatbuffers                  1.12
fonttools                    4.36.0
gast                         0.3.3
glfw                         2.5.4
google-auth                  2.10.0
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.32.0
gym                          0.25.1
gym-notices                  0.0.8
h5py                         2.10.0
httplib2                     0.18.1
idna                         3.3
imageio                      2.21.1
importlib-metadata           4.12.0
keras                        2.6.0
Keras-Preprocessing          1.1.2
kiwisolver                   1.4.4
labmaze                      1.0.5
libclang                     14.0.6
lxml                         4.9.1
Markdown                     3.4.1
MarkupSafe                   2.1.1
matplotlib                   3.5.3
mujoco                       2.2.1
networkx                     2.6.3
numpy                        1.19.5
oauth2client                 4.1.3
oauthlib                     3.2.0
opencv-python                4.6.0.66
opt-einsum                   3.3.0
packaging                    21.3
pandas                       1.3.5
Pillow                       9.2.0
pip                          22.2.2
protobuf                     3.19.4
pyasn1                       0.4.8
pyasn1-modules               0.2.8
PyOpenGL                     3.1.6
pyparsing                    2.4.7
python-dateutil              2.8.2
python-telegram-bot          12.8
python-xlib                  0.27
pytz                         2022.2.1
PyWavelets                   1.3.0
PyYAML                       6.0
requests                     2.28.1
requests-oauthlib            1.3.1
rsa                          4.9
ruamel.yaml                  0.17.21
ruamel.yaml.clib             0.2.6
s-tui                        1.0.2
scikit-image                 0.19.3
scikit-video                 1.1.11
scipy                        1.7.3
setuptools                   65.0.2
six                          1.15.0
telegram-send                0.25
tensorboard                  2.9.1
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorflow                   2.4.0
tensorflow-estimator         2.4.0
tensorflow-io-gcs-filesystem 0.26.0
tensorflow-probability       0.12.0
termcolor                    1.1.0
tifffile                     2021.11.2
tqdm                         4.64.0
typing-extensions            3.7.4.3
ueberzug                     18.1.6
urllib3                      1.26.11
urwid                        2.1.1
Werkzeug                     2.2.2
wheel                        0.37.1
wrapt                        1.12.1
zipp                         3.8.1

kyonofx commented 2 years ago

Hi, several observations from your log:

There are some cuda packages that were not successfully loaded:

2022-08-18 18:37:54.868858: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/lukas/anaconda3/envs/tia/lib/python3.7/site-packages/cv2/../../lib64::/home/lukas/.mujoco/mujoco200/bin:/home/lukas/.mujoco/mujoco210/bin:/usr/lib/nvidia
2022-08-18 18:37:54.869412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2022-08-18 18:37:54.869517: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2022-08-18 18:37:54.869527: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

I did not have this issue.

Prefill dataset with 0 steps. is strange. Did you already run the code before and have already collected 5000 steps data in the target log directory?
Found 253201 disen_reward parameters. There should be several other lines of output about #params of other model components.
You are using a different version of tensorflow and tensorflow-probability. I also don't see tensorflow-gpu.

trevormcinroe commented 2 years ago

Hi kyonofx,

Thank you for getting back to us so quickly.

One quick question. What version of CUDA are you running on your system?

In the codebase's README, it suggests to use tensorflow-gpu==2.3.1. According to this table, this version of tensorflow-gpu requires CUDA < 11. Unfortunately, we cannot downgrade the CUDA version on our GPUs. We might have versioning issues with tensorflow-gpu and tensorflow_probability.

kyonofx commented 2 years ago

I used CUDA 10.1. It is possible to have multiple CUDA versions on the same server machine, you can install CUDA 10.1 as long as you have sudo permission. Maybe this would be the main cause?

kyonofx / tia

Issues with running code #2