Module 5 - Errors when Training the Model - Githubissues

huggingface / deep-rl-class

This repo contains the syllabus of the Hugging Face Deep Reinforcement Learning Course.

Apache License 2.0

3.87k stars 594 forks source link

Module 5 - Errors when Training the Model #269

Closed BoschAI closed 1 year ago

BoschAI commented 1 year ago

I created the SnowballTarget.yaml file and proceeded with the tutorial. When I get to "Train the Agent" section and run the code, I get a long list of errors. I am not sure how to correct this and complete the exercise.

Version information: ml-agents: 0.31.0.dev0, ml-agents-envs: 0.31.0.dev0, Communicator API: 1.5.0, PyTorch: 1.11.0+cu102 [INFO] Connected to Unity environment with package version 2.1.0-exp.1 and communication version 1.5.0 [INFO] Connected new brain: SnowballTarget?team=0 2023-04-03 20:10:50.913408: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

ImportError: numpy.core._multiarray_umath failed to import Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

ImportError: numpy.core.umath failed to import Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

ImportError: numpy.core._multiarray_umath failed to import Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

ImportError: numpy.core.umath failed to import Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

RuntimeError: module compiled against API version 0xf but this version of numpy is 0xe Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

ImportError: numpy.core._multiarray_umath failed to import Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

ImportError: numpy.core.umath failed to import Traceback (most recent call last): File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 42, in tf from tensorboard.compat import notf # noqa: F401 ImportError: cannot import name 'notf' from 'tensorboard.compat' (/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/bin/mlagents-learn", line 33, in sys.exit(load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')()) File "/content/ml-agents/ml-agents/mlagents/trainers/learn.py", line 264, in main run_cli(parse_command_line()) File "/content/ml-agents/ml-agents/mlagents/trainers/learn.py", line 260, in run_cli run_training(run_seed, options, num_areas) File "/content/ml-agents/ml-agents/mlagents/trainers/learn.py", line 136, in run_training tc.start_learning(env_manager) File "/content/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped return func(*args, *kwargs) File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 172, in start_learning self._reset_env(env_manager) File "/content/ml-agents/ml-agents-envs/mlagents_envs/timers.py", line 305, in wrapped return func(args, **kwargs) File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 107, in _reset_env self._register_new_behaviors(env_manager, env_manager.first_step_infos) File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 267, in _register_new_behaviors self._create_trainers_and_managers(env_manager, new_behavior_ids) File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 165, in _create_trainers_and_managers self._create_trainer_and_manager(env_manager, behavior_id) File "/content/ml-agents/ml-agents/mlagents/trainers/trainer_controller.py", line 125, in _create_trainer_and_manager trainer = self.trainer_factory.generate(brain_name) File "/content/ml-agents/ml-agents/mlagents/trainers/trainer/trainer_factory.py", line 58, in generate return TrainerFactory._initialize_trainer( File "/content/ml-agents/ml-agents/mlagents/trainers/trainer/trainer_factory.py", line 105, in _initialize_trainer trainer = trainer_type( File "/content/ml-agents/ml-agents/mlagents/trainers/ppo/trainer.py", line 52, in init super().init( File "/content/ml-agents/ml-agents/mlagents/trainers/trainer/on_policy_trainer.py", line 44, in init super().init( File "/content/ml-agents/ml-agents/mlagents/trainers/trainer/rl_trainer.py", line 50, in init self._stats_reporter.add_property( File "/content/ml-agents/ml-agents/mlagents/trainers/stats.py", line 322, in add_property writer.add_property(self.category, property_type, value) File "/content/ml-agents/ml-agents/mlagents/trainers/stats.py", line 283, in add_property self._maybe_create_summary_writer(category) File "/content/ml-agents/ml-agents/mlagents/trainers/stats.py", line 259, in _maybe_create_summary_writer self.summary_writers[category] = SummaryWriter(filewriter_dir) File "/usr/local/lib/python3.9/dist-packages/torch/utils/tensorboard/writer.py", line 220, in init self._get_file_writer() File "/usr/local/lib/python3.9/dist-packages/torch/utils/tensorboard/writer.py", line 250, in _get_file_writer self.file_writer = FileWriter(self.log_dir, self.max_queue, File "/usr/local/lib/python3.9/dist-packages/torch/utils/tensorboard/writer.py", line 60, in init self.event_writer = EventFileWriter( File "/usr/local/lib/python3.9/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in init tf.io.gfile.makedirs(logdir) File "/usr/local/lib/python3.9/dist-packages/tensorboard/lazy.py", line 65, in getattr return getattr(load_once(self), attr_name) File "/usr/local/lib/python3.9/dist-packages/tensorboard/lazy.py", line 97, in wrapper cache[arg] = f(arg) File "/usr/local/lib/python3.9/dist-packages/tensorboard/lazy.py", line 50, in load_once module = load_fn() File "/usr/local/lib/python3.9/dist-packages/tensorboard/compat/init.py", line 45, in tf import tensorflow File "/usr/local/lib/python3.9/dist-packages/tensorflow/init.py", line 37, in from tensorflow.python.tools import module_util as _module_util File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/init.py", line 42, in from tensorflow.python import data File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/init.py", line 21, in from tensorflow.python.data import experimental File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/experimental/init.py", line 97, in from tensorflow.python.data.experimental import service File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/experimental/service/init.py", line 419, in from tensorflow.python.data.experimental.ops.data_service_ops import distribute File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/experimental/ops/data_service_ops.py", line 22, in from tensorflow.python.data.experimental.ops import compression_ops File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/experimental/ops/compression_ops.py", line 16, in from tensorflow.python.data.util import structure File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/util/structure.py", line 22, in from tensorflow.python.data.util import nest File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/util/nest.py", line 34, in from tensorflow.python.framework import sparse_tensor as _sparse_tensor File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/sparse_tensor.py", line 25, in from tensorflow.python.framework import constant_op File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/constant_op.py", line 25, in from tensorflow.python.eager import execute File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/execute.py", line 21, in from tensorflow.python.framework import dtypes File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/dtypes.py", line 37, in _np_bfloat16 = _pywrap_bfloat16.TF_bfloat16_type() TypeError: Unable to convert function return value to a Python type! The signature was () -> handle

mattsthilaire commented 1 year ago

I'm getting this really nasty error myself. I tried the suggested fix seen below, but no luck:

https://github.com/NVlabs/stylegan3/issues/181

mattsthilaire commented 1 year ago

Digging into this a bit further, it looks like the issue has something to do with Tensorboard logging...something?

If you comment out these lines in the in ml-agents/ml-agents/trainers/trainer/learn.py file:

Colab seems to run and train, just without the Tensorboard logging. I haven't waited long enough to see what the end results are, however, it is running. Not sure if this affects loading the model to the hub or not.

This might be more of an issue with dependencies in the ml-agents library though.

Pinging @simoninithomas. Seems to be a hard stop blocker on Unit 5.

EDIT: So it does work and train, and you are able to push it to the hub with those lines commented out. @BoschAI Maybe a messed up way to do it, but at least it gets you closer to finishing the course for now :)

simoninithomas commented 1 year ago

Hey there 👋 a simpler solution was to uninstall tensorflow 2 (people on discord gave me the solution 😄 ). We updated the notebooks (Huggy and Unit 5). It should work fine now 🤗