Issue using the GPU with LabGym

Cerilam commented 1 year ago

Hi, First of all, I really like your LabGym and I am heading toward really nice result ! For info, we spoke by email earlier this year for some issues (I won't give my name here but my initial are B.S.)

I post here because I believe this issue can interest/help others. I do not manage to use my GPU with LabGym. Training a Categorizer is using 100% of my CPU, a lot of my memory, but my GPU activity is almost 0. CUDA is installed and may or may not be set up properly for LabGym

GPU : NVIDIA RTX A2000 12GB

using CMD, typing nvcc --version output the following

 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2023 NVIDIA Corporation
 Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
 Cuda compilation tools, release 12.2, V12.2.91
 Build cuda_12.2.r12.2/compiler.32965470_0

Also, when the training start, the CMD show this message :

I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

Unfortunately, I am not familiar with TensorFlow enough to know how to manage this issue

Thanks for the help

B. S.

yujiahu415 commented 1 year ago

Hi,

Thank you so much! And yes I remember you!

It's normal to see this message 'I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: SSE SSE2 SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.'

Are there any additional messages after that? Like kind of 'cudartxxx.dll is missing'? If not, then probability the issue is because the CUDA toolkit version you used is too high to be compatible for the Tensorflow.

So I suggest you downgrade CUDA toolkit to v11.7. I have tested this version of CUDA and it worked well to initiate the GPU usage with both Tensorflow and PyTorch (for the 'Detectors').

Let me know how it goes. Thanks again!

Cerilam commented 1 year ago

Hi,

Unfortunately, I still have the same error after downloading the 11.7.1 and deleting the newest version. I have no error when running the detectors and the categorizers, but I have 0 activity on my GPU for both. "cudartxxx.dll is missing" => No error of this kind.

new nvcc :

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:59:34_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Thanks again

yujiahu415 commented 1 year ago

Hi,

This is strange but I think it's addressable. You have a good GPU with large VRAM and could accelerate the training and inferencing speed especially when you use Detectors. It will be sad if you cannot utilize it.

So first make sure the GPU driver is up-to-date. I just did a search and the latest version of your GPU driver should be R535 U4 (536.96): https://www.nvidia.com/Download/driverResults.aspx/210450/en-us/. You can check the installed driver version of your GPU by typing nvidia-smi in your cmd / terminal.

After that, make sure to add the path of cuda toolkit to system environment variables. Check this: https://docs.nvidia.com/gameworks/content/developertools/desktop/environment_variables.htm on how to find where is the 'environment variables' in Windows. You can also check this blog to see how to edit the environment variables: https://medium.com/@sunn-e/how-to-install-cuda-10-and-cudnn-for-tensorflow-gpu-on-windows-10-414c10eabc96.

If completing the above two still does not address the problem, then I suggest you to try some specific versions of Tensorflow and PyTorch. I have been using LabGym with tensorflow==2.7.0 together with torch==2.0.1 + cuda 11.7 on a virtual machine and it runs on GPU well. You can install these specific versions by typing python3 -m pip install tensorflow==2.7.0 and python3 -m pip install torch==2.0.1. After that, activate python3 and check the versions of the two by such as: import tensorflow as tf and print(tf.__version__).

Let me know if these suggestions solve this issue. Thanks!

Cerilam commented 1 year ago

Hi

I installed Python 3.9.7 and created a new environnement LabGym using Python 3.9.7 py -3.9 -m venv LabGym

installing LabGym as usual pip install LabGym

Changing the tensorflow version to the 2.7.0 pip install --upgrade tensorflow==2.7.0 (torch version is torch==2.0.1already)

I've needed to downgrade protobuf aswell : pip install protobuf 3.20.1

Using import tensorflow as tf tf.test.is_built_with_cuda() I got a TRUE. Beforehand it was a FALSE.

My GPU usage during training is around 5-10%, is this a good amount?

Also I had an error in the end of the epoch (never had it before) :

Epoch 00001: val_loss improved from inf to 0.56480, saving model to C:\Users\bs\LabGym\Lib\site-packages\LabGym\models\MARCHEFDP 2023-08-16 13:33:35.087938: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them. WARNING:absl:Found untraced functions such as lstm_cell_layer_call_fn, lstm_cell_layer_call_and_return_conditional_losses, lstm_cell_layer_call_fn, lstm_cell_layer_call_and_return_conditional_losses, lstm_cell_layer_call_and_return_conditional_losses while saving (showing 5 of 5). These functions will not be directly callable after loading. Traceback (most recent call last): File "C:\Users\bs\LabGym\lib\site-packages\LabGym\gui_categorizers.py", line 997, in train_categorizer CA.train_combnet(self.data_path,self.path_to_categorizer,self.out_path,dim_tconv=self.dim_tconv,dim_conv=self.dim_conv,channel=self.channel,time_step=self.length,level_tconv=self.level_tconv,level_conv=self.level_conv,aug_methods=self.aug_methods,augvalid=self.augvalid,include_bodyparts=self.include_bodyparts,std=self.std,background_free=self.background_free,behavior_mode=self.behavior_mode,social_distance=self.social_distance) File "C:\Users\bs\LabGym\lib\site-packages\LabGym\categorizers.py", line 1053, in train_combnet H=model.fit([train_animations,train_pattern_images],trainY,batch_size=batch_size,validation_data=([test_animations_tensor,test_pattern_images_tensor],testY_tensor),epochs=1000000,callbacks=[cp,es,rl]) File "C:\Users\bs\LabGym\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "C:\Users\bs\LabGym\lib\site-packages\tensorflow\python\saved_model\save.py", line 1061, in _write_object_proto registered_name = registration.get_registered_name(obj) AttributeError: module 'tensorflow.python.saved_model.registration' has no attribute 'get_registered_name'

Thanks again BS.

yujiahu415 commented 1 year ago

This error is because the version 2.7.0 of tensorflow is too old for your system. See this post: https://stackoverflow.com/questions/76413746/saving-a-model-i-get-module-tensorflow-python-saved-model-registration-has-no. I think the solution is to find a tensorflow version that is between 'too old' and 'too new'. In that post they tried 2.12.0. Maybe you can also try that version. Just to find a version that can work with your GPU and also avoid the errors caused by outdated functions.

The low GPU usage can be normal because this is task-specific, for example, if you were running a simple Categorizer with small dataset. But was the training speed much faster than before with the use of GPU?

And also, can PyTorch use your GPU? You can run import torch and torch.cuda.is_available() to see if PyTorch can use your GPU. This is more critical for the processing speed especially for the inferencing speed. During analysis (inferencing), the speed of Detectors (using PyTorch) can be benefited much more than Categorizers (using tensorflow) by using a GPU.

Cerilam commented 1 year ago

Hey

Quick update, but I still need to try out things, I had not much time recently.

torch.cuda.is_available() was returning false.

using I did uninstall pytorch, torchvision and torchaudio and installed them this way : conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Now, the training for detectors is super fast and using the GPU at around 20%.

I am currently trying tensorflow versions, I'll be sure to keep you informed of my progress (At the current state, training Categorizers does not use either CPU or GPU according to Task manager, but is pretty fast. It is odd to me so I'd like to do more testing.)

Thanks again BS.

yujiahu415 commented 1 year ago

Thank you for the update and the details on how to solve the issue! They are very helpful! Now at least Detectors can use GPU and your analysis speed will be much faster than before. And yes, let me know how the tensorflow issues go. Hopefully the Categorizers can also use your GPU and the training and analysis speed will be even faster.

Cerilam commented 1 year ago

Final update : Everything is now working !

The version of tensorflow I installed is tensorflow==2.10.0 I had to manually add a .ddl files to my cudnn folder but tensorflow was telling me when using tf.test.is_built_with_cuda() that the file was missing, so it should not be a problem if it happend again to someone else.

If someone has similar problem, everything should be in the thread

umyelab / LabGym

Issue using the GPU with LabGym #59