uzh-rpg / agile_autonomy

Repository Containing the Code associated with the Paper: "Learning High-Speed Flight in the Wild"
GNU General Public License v3.0
603 stars 164 forks source link

Tensorflow OOM when trying to launch test_trajectory.py #17

Open den250400 opened 2 years ago

den250400 commented 2 years ago

After 1 month of trials and errors, I was finally able to build agile_autonomy. The only ROS package I was unable to build is mpl_test_node, though I think that it's not crucial part of the system.

roslaunch agile_autonomy simulation.launch launched nice - I was even able to make the copter hover via GUI interface. However, when I tried to fly using the network's predictions (python test_trajectories.py --settings_file=config/test_settings.yaml), I encountered the following error:

2021-11-24` 18:58:17.380102: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.828541: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-24 18:58:19.829417: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-11-24 18:58:19.855816: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.856260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.856316: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.858212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.858527: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.860495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.861046: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.863336: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.864947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.869907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.870182: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.870822: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.871087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.872465: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-24 18:58:19.873089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.873410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.873488: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.873558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.873608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.873641: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.873671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.873701: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.873731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.873761: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.873862: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.874651: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:20.494457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-24 18:58:20.494495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-11-24 18:58:20.494503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-11-24 18:58:20.494743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495361: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.496326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2922 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-11-24 18:58:20.496799: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
------------------------------------------
Restored from models/ckpt-50
------------------------------------------
2021-11-24 18:58:23.119284: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-24 18:58:23.183628: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499950000 Hz
2021-11-24 18:58:23.684142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:24.938844: W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
2021-11-24 18:58:24.990377: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-11-24 18:58:25.396965: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:31.038641: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.619634: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.685442: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.73G (2931228672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Net initialized
Traceback (most recent call last):
  File "test_trajectories.py", line 19, in <module>
    main()
  File "test_trajectories.py", line 15, in main
    trainer.perform_testing()
  File "/home/denis/agile_autonomy_ws/catkin_aa/src/agile_autonomy/planner_learning/dagger_training.py", line 125, in perform_testing
    removable_rollout_folders = os.listdir(self.settings.expert_folder)
FileNotFoundError: [Errno 2] No such file or directory: '../data_generation/data/'
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.iter'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.beta_1'
Arguments: ()

I don't know what exactly is the source of the problem, but I have two suggestions: 1) My GPU really doesn't have enough memory (my GPU is Geforce GTX1650 Ti and has 4 Gb of memory) 2) Checkpoint files were saved using older tensorflow version, and the newer one fails to read them. @antonilo @kelia Can you please provide your tensorflow version?

antonilo commented 2 years ago

Create this directory and you're ready to go: mkdir ../data_generation/data/

swxgithub commented 2 years ago

How did you solve the problem of the mpl_test_node package, when I run roslaunch agile_autonomy simulation.launch, I faced the problem: Resource not found: mpl_test_node.

den250400 commented 2 years ago

How did you solve the problem of the mpl_test_node package, when I run roslaunch agile_autonomy simulation.launch, I faced the problem: Resource not found: mpl_test_node.

I also have this error message, but nevertheless launch continues and the simulation window opens

den250400 commented 2 years ago

Create this directory and you're ready to go: mkdir ../data_generation/data/

Thanks a lot! After creating the directory, test_trajectory.py launched, copter spawned in the forest and started to avoid trees.

However, the "RGB" camera window in rviz appears to have only 1-2 fps (and so for "rpg_flightmare" window). Is there a way to accelerate this?

antonilo commented 2 years ago

There is, unfortunately, nothing you can do about that except get a faster computer. That is just related to the computational budget of your machine.

calmelo commented 2 years ago

After 1 month of trials and errors, I was finally able to build agile_autonomy. The only ROS package I was unable to build is mpl_test_node, though I think that it's not crucial part of the system.

roslaunch agile_autonomy simulation.launch launched nice - I was even able to make the copter hover via GUI interface. However, when I tried to fly using the network's predictions (python test_trajectories.py --settings_file=config/test_settings.yaml), I encountered the following error:

2021-11-24` 18:58:17.380102: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.828541: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-24 18:58:19.829417: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-11-24 18:58:19.855816: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.856260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.856316: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.858212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.858527: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.860495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.861046: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.863336: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.864947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.869907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.870182: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.870822: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.871087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.872465: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-24 18:58:19.873089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.873410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.873488: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.873558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.873608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.873641: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.873671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.873701: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.873731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.873761: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.873862: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.874651: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:20.494457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-24 18:58:20.494495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-11-24 18:58:20.494503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-11-24 18:58:20.494743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495361: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.496326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2922 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-11-24 18:58:20.496799: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
------------------------------------------
Restored from models/ckpt-50
------------------------------------------
2021-11-24 18:58:23.119284: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-24 18:58:23.183628: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499950000 Hz
2021-11-24 18:58:23.684142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:24.938844: W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
2021-11-24 18:58:24.990377: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-11-24 18:58:25.396965: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:31.038641: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.619634: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.685442: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.73G (2931228672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Net initialized
Traceback (most recent call last):
  File "test_trajectories.py", line 19, in <module>
    main()
  File "test_trajectories.py", line 15, in main
    trainer.perform_testing()
  File "/home/denis/agile_autonomy_ws/catkin_aa/src/agile_autonomy/planner_learning/dagger_training.py", line 125, in perform_testing
    removable_rollout_folders = os.listdir(self.settings.expert_folder)
FileNotFoundError: [Errno 2] No such file or directory: '../data_generation/data/'
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.iter'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.beta_1'
Arguments: ()

I don't know what exactly is the source of the problem, but I have two suggestions:

1. My GPU really doesn't have enough memory (my GPU is Geforce GTX1650 Ti and has 4 Gb of memory)

2. Checkpoint files were saved using older tensorflow version, and the newer one fails to read them. @antonilo @kelia Can you please provide your tensorflow version?

I saw this paper said that we can perform the network on CPU. Did you know how can we perform the network on CPU?

calmelo commented 2 years ago

After 1 month of trials and errors, I was finally able to build agile_autonomy. The only ROS package I was unable to build is mpl_test_node, though I think that it's not crucial part of the system.

roslaunch agile_autonomy simulation.launch launched nice - I was even able to make the copter hover via GUI interface. However, when I tried to fly using the network's predictions (python test_trajectories.py --settings_file=config/test_settings.yaml), I encountered the following error:

2021-11-24` 18:58:17.380102: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.828541: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-24 18:58:19.829417: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-11-24 18:58:19.855816: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.856260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.856316: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.858212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.858527: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.860495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.861046: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.863336: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.864947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.869907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.870182: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.870822: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.871087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.872465: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-24 18:58:19.873089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.873410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.873488: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.873558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.873608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.873641: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.873671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.873701: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.873731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.873761: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.873862: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.874651: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:20.494457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-24 18:58:20.494495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-11-24 18:58:20.494503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-11-24 18:58:20.494743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495361: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.496326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2922 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-11-24 18:58:20.496799: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
------------------------------------------
Restored from models/ckpt-50
------------------------------------------
2021-11-24 18:58:23.119284: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-24 18:58:23.183628: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499950000 Hz
2021-11-24 18:58:23.684142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:24.938844: W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
2021-11-24 18:58:24.990377: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-11-24 18:58:25.396965: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:31.038641: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.619634: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.685442: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.73G (2931228672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Net initialized
Traceback (most recent call last):
  File "test_trajectories.py", line 19, in <module>
    main()
  File "test_trajectories.py", line 15, in main
    trainer.perform_testing()
  File "/home/denis/agile_autonomy_ws/catkin_aa/src/agile_autonomy/planner_learning/dagger_training.py", line 125, in perform_testing
    removable_rollout_folders = os.listdir(self.settings.expert_folder)
FileNotFoundError: [Errno 2] No such file or directory: '../data_generation/data/'
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.iter'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.beta_1'
Arguments: ()

I don't know what exactly is the source of the problem, but I have two suggestions:

1. My GPU really doesn't have enough memory (my GPU is Geforce GTX1650 Ti and has 4 Gb of memory)

2. Checkpoint files were saved using older tensorflow version, and the newer one fails to read them. @antonilo @kelia Can you please provide your tensorflow version?

I initialized the network very slowly. Did you meet the problem?

den250400 commented 2 years ago

I initialized the network very slowly. Did you meet the problem?

On my machine, neural net was also initializing around 40 secs, which is quite slow, but still enough for experiments.

stochasticritic commented 2 years ago

After 1 month of trials and errors, I was finally able to build agile_autonomy. The only ROS package I was unable to build is mpl_test_node, though I think that it's not crucial part of the system.

roslaunch agile_autonomy simulation.launch launched nice - I was even able to make the copter hover via GUI interface. However, when I tried to fly using the network's predictions (python test_trajectories.py --settings_file=config/test_settings.yaml), I encountered the following error:

2021-11-24` 18:58:17.380102: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.828541: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-11-24 18:58:19.829417: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-11-24 18:58:19.855816: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.856260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.856316: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.858212: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.858527: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.860495: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.861046: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.863336: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.864947: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.869907: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.870182: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.870822: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.871087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.872465: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-24 18:58:19.873089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.873410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 3.82GiB deviceMemoryBandwidth: 178.84GiB/s
2021-11-24 18:58:19.873488: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:19.873558: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:19.873608: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.10
2021-11-24 18:58:19.873641: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-11-24 18:58:19.873671: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-11-24 18:58:19.873701: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-11-24 18:58:19.873731: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-11-24 18:58:19.873761: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:19.873862: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:19.874564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-11-24 18:58:19.874651: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2021-11-24 18:58:20.494457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-11-24 18:58:20.494495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-11-24 18:58:20.494503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-11-24 18:58:20.494743: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495361: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.495849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 18:58:20.496326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2922 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-11-24 18:58:20.496799: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
------------------------------------------
Restored from models/ckpt-50
------------------------------------------
2021-11-24 18:58:23.119284: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-11-24 18:58:23.183628: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2499950000 Hz
2021-11-24 18:58:23.684142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.7
2021-11-24 18:58:24.938844: W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
2021-11-24 18:58:24.990377: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-11-24 18:58:25.396965: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-11-24 18:58:31.038641: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.619634: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2021-11-24 18:58:31.685442: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 2.73G (2931228672 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Net initialized
Traceback (most recent call last):
  File "test_trajectories.py", line 19, in <module>
    main()
  File "test_trajectories.py", line 15, in main
    trainer.perform_testing()
  File "/home/denis/agile_autonomy_ws/catkin_aa/src/agile_autonomy/planner_learning/dagger_training.py", line 125, in perform_testing
    removable_rollout_folders = os.listdir(self.settings.expert_folder)
FileNotFoundError: [Errno 2] No such file or directory: '../data_generation/data/'
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.iter'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 69, in emit
    if self.shouldRollover(record):
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/handlers.py", line 183, in shouldRollover
    self.stream = self._open()
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/logging/__init__.py", line 1116, in _open
    return open(self.baseFilename, self.mode, encoding=self.encoding)
NameError: name 'open' is not defined
Call stack:
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 161, in __del__
    .format(pretty_printer.node_names[node_id]))
  File "/home/denis/anaconda3/envs/tf_24/lib/python3.7/site-packages/tensorflow/python/platform/tf_logging.py", line 178, in warning
    get_logger().warning(msg, *args, **kwargs)
Message: 'Unresolved object in checkpoint: (root).optimizer.beta_1'
Arguments: ()

I don't know what exactly is the source of the problem, but I have two suggestions:

  1. My GPU really doesn't have enough memory (my GPU is Geforce GTX1650 Ti and has 4 Gb of memory)
  2. Checkpoint files were saved using older tensorflow version, and the newer one fails to read them. @antonilo @kelia Can you please provide your tensorflow version?

Same is the case for me. Still, the simulation ceases to work in my case.