mlfpm / deepof

DeepLabCut based data analysis package including pose estimation and representation learning mediated behavior recognition
MIT License
39 stars 6 forks source link

Deepof can't use GPU #37

Closed Xiangshougudu closed 5 months ago

Xiangshougudu commented 7 months ago

Hello, Thank you so much for providing such a powerful tool. Following your tips, I created deepof's virtual environment using conda on windows11.

Unfortunately, when I install the Tensorflow-GPU version, it always tells me that there are some packages that conflict with each other, which prevents my tensorflow from calling the GPU. I don't know how to solve it. Can you help me? Or if you could provide the full requirements.txt or environment.yml so I can completely replicate your environment.

On the other hand, you wrote in the paper, DeepOF minimizes ELBO. But I can't find the corresponding loss function in your code, can you help me point them out?

Best wishes

lucasmiranda42 commented 7 months ago

Dear @Xiangshougudu,

Thank you for your interest in DeepOF!

We indeed detected an issue with GPU usage in Windows that we're currently troubleshooting. I'll get back to you as soon as possible within this week with a potential solution (we're about to release a new patched version).

Regarding the implementation of the ELBO minimization in our take on VaDE, I agree the code may be a bit confusing. Let's split the loss between reconstruction and KL divergence between prior and posterior:

  1. The reconstruction loss should be the most straightforward to find, located in line 1736 of deepof/models.py, as part of the train_step() method in the VaDE class.
# Compute reconstruction loss
reconstruction_loss = -tf.reduce_mean(reconstructions.log_prob(seq_inputs))
total_loss += reconstruction_loss
  1. Now, the KL divergence term is indeed a bit trickier. If you look at line 1720, you'll see that the total_loss object comes from retrieving a set of losses that are computed within the model itself:
total_loss = sum(self.vade.losses)

In particular, KL Divergence between the multimodal prior and posterior is computed within the GaussianMixtureLatent class, using tf.keras.Layer.add_loss(). You can find the exact lines here(1251-1286).

Bear in mind that the model does not use the standard implementation of ELBO, but rather a multi-modal version based on VaDE. Please do not hesitate to ask if you have any questions!

Best wishes, and we'll keep you posted with the GPU fix in Windows, Lucas

Xiangshougudu commented 7 months ago

Hello,

Thank you for the update.

I tried to recreate the conda environment using your updated content using the follow steps: conda create -n deepof python=3.9 pip install -r requirements.txt

When I run the code below, I find that I still can't call the GPU import tensorflow as tf print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available: 0

When I tried to execute the deepof_unsupervised_tutorial demo, I received the following error: 2024-04-18 23:40:36.730028: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-04-18 23:40:36.732187: I tensorflow/core/profiler/lib/profiler_session.cc:101] Profiler session initializing. 2024-04-18 23:40:36.732272: I tensorflow/core/profiler/lib/profiler_session.cc:116] Profiler session started. 2024-04-18 23:40:36.732390: I tensorflow/core/profiler/lib/profiler_session.cc:128] Profiler session tear down. The initializer GlorotUniform is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initalizer instance more than once. 2024-04-18 23:40:51.328344: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations. 2024-04-18 23:41:19.342734: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at summary_kernels.cc:65 : NOT_FOUND: Failed to create a NewWriteableFile: E:\code\behaviors\deepof-0.6.1\test_single_topview\deepof_tutorial_project_test\Trained_models\fit\deepof_unsupervised_VaDE_recurrent_encodings_input_type=coords_kmeans_loss=0.0_encoding=4_k=10_20240418-234036\train/events.out.tfevents.1713454879.DESKTOP-D42M5T7.3336.0.v2 : ϵͳ�Ҳ���ָ����·���� ; No such process Creating writable file E:\code\behaviors\deepof-0.6.1\test_single_topview\deepof_tutorial_project_test\Trained_models\fit\deepof_unsupervised_VaDE_recurrent_encodings_input_type=coords_kmeans_loss=0.0_encoding=4_k=10_20240418-234036\train/events.out.tfevents.1713454879.DESKTOP-D42M5T7.3336.0.v2 Could not initialize events writer. Traceback (most recent call last): File "E:\code\behaviors\deepof-0.6.1\deepof_unsupervised.py", line 60, in trained_model = my_deepof_project.deep_unsupervised_embedding( File "E:\code\behaviors\deepof-0.6.1\deepof\data.py", line 1785, in deep_unsupervised_embedding trained_models = deepof.model_utils.embedding_model_fitting( File "E:\code\behaviors\deepof-0.6.1\deepof\model_utils.py", line 1364, in embedding_model_fitting ae_full_model.fit( File "D:\ProgramData\Anaconda3\envs\deepof3D\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler raise e.with_traceback(filtered_tb) from None File "D:\ProgramData\Anaconda3\envs\deepof3D\lib\site-packages\tensorflow\python\ops\gen_summary_ops.py", line 140, in create_summary_file_writer _result = pywrap_tfe.TFE_Py_FastPathExecute( UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 424: invalid continuation byte

lucasmiranda42 commented 7 months ago

Dear @Xiangshougudu,

Thank you for the follow-up! We released a new version (0.6.1) to PyPI a few days ago, but we still could not test it with a Windows GPU machine (that's why I hadn't got back to you yet).

However, it would indeed be wonderful if you can try it out! To install it, however, you should avoid using the requirements.txt file. You can follow the instructions in our documentation, by either:

  1. Creating the conda environment and installing the package via pip:
conda create -n deepof python=3.9
conda activate deepof
pip install deepof
  1. Cloning the latest version of our repository, and installing via poetry:
conda create -n deepof python=3.9
conda activate deepof
conda install poetry
git clone https://github.com/mlfpm/deepof.git
cd deepof
poetry install
  1. Pulling our latest Docker image (note that this assumes that you have Docker installed):
# download the latest available image
docker pull lucasmiranda42/deepof:latest
# run the image in interactive mode, enabling you to open python and import deepof
docker run -it lucasmiranda42/deepof

Please let us know if you succeed with any of them! And we'll update the thread as soon as we manage to test on a Windows GPU.

Best wishes, and thank you very much once again for your interest, Lucas

Xiangshougudu commented 7 months ago

Thanks for your reply, I will try it again.

lucasmiranda42 commented 5 months ago

Dear @Xiangshougudu,

The patch indeed seems to have fixes the issue on Windows. I will close the thread for now, but of course feel free to reopen if you still run into trouble!

Best, Lucas

Xiangshougudu commented 5 months ago

Dear @lucasmiranda42 ,

Thank you very much for your reply again. Can you successfully invoke GPU on windows? Is it on the conda? In fact, it still doesn't call the GPU on my windows. When I configured the DeepOF 0.6.1 environment earlier, I found that I needed to manually install tensorflow-gpu. However, the TensorFlowwebsite states that TensorFlow-GPU is already integrated into TensorFlow, but I still tried to install TensorFlow-GPU. It turns out that its dependencies conflict with TensorFlow, so I'm waiting for the TensorFlow-GPU to update to a version that doesn't conflict between them. I'll try to configure it again later, and let me know if you have a new way to configure your environment. Thank you again!

Best wishes