NaN training issue - Githubissues

kristinbranson / APT

Animal Part Tracker

GNU General Public License v3.0

71 stars 16 forks source link

NaN training issue #391

Closed happyqiu closed 2 years ago

happyqiu commented 2 years ago

Hello, I'm using Windows system and MATLAB 2022a to run the APT and choosing DLC, but always found training failed. (N. iterations: NaN/10000) And the log file shows like this:

Job 1:

C:\Users\yuechen\Documents.apt\tp754f9c10_94f3_440a_925f_8b8a4ca620b7\YQ2_20220517_side\20220531T120619view0_20220531T120636_new.log

C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Using TensorFlow backend. Traceback (most recent call last): File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\APT_interface.py", line 35, in from deeplabcut.pose_estimation_tensorflow.train import train as deepcut_train File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut__init.py", line 51, in from deeplabcut.pose_estimation_tensorflow import train_network, return_train_network_path File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow\init__.py", line 18, in from deeplabcut.pose_estimation_tensorflow.evaluate import * File "C:\Users\yuechen\Documents\MATLAB\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow\evaluate.py", line 18, in import pandas as pd ModuleNotFoundError: No module named 'pandas'

After I installed this package, the error refers to other modules not found. I'm wondering if there is anything else I need to install except the packages you mentioned in the APT documentation. Thanks a lot!

allenleetc commented 2 years ago

Hi happyqiu,

Thanks for your report! Yes I am encountering similar issues and our Conda environment probably needs an update.

One question -- in setting up your Conda environment, did you follow the wiki and use the environment file located at <APT>\condaenv\env.yml? Or did you follow the instructions in the doc?

I ask just for context as the error you encountered differs slightly from mine. There's no need to try out the second option as we probably need to do a quick update. Will be back thanks again!

happyqiu commented 2 years ago

Hi Allen,

Thanks for your quick response! I followed the doc to setup the environment.

From: Allen Lee @.> Sent: Tuesday, May 31, 2022 9:01:32 PM To: kristinbranson/APT @.> Cc: happyqiu @.>; Author @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

Hi happyqiu,

Thanks for your report! Yes I am encountering similar issues and our Conda environment probably needs an update.

One question -- in setting up your Conda environment, did you follow the wikihttps://github.com/kristinbranson/APT/wiki/Windows-&-Conda-Setup and use the environment file located at \condaenv\env.yml? Or did you follow the instructions in the dochttp://kristinbranson.github.io/APT/LocalBackEnd.html?

I ask just for context as the error you encountered differs slightly from mine. There's no need to try out the second option as we probably need to do a quick update. Will be back thanks again!

― Reply to this email directly, view it on GitHubhttps://github.com/kristinbranson/APT/issues/391#issuecomment-1142935636, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMDHPHVJBXTK2ZJKMWHAU53VM2Y6ZANCNFSM5XOFD3QA. You are receiving this because you authored the thread.Message ID: @.***>

kristinbranson commented 2 years ago

I updated the instructions here: http://kristinbranson.github.io/APT/LocalBackEnd.html for setting up Conda and Windows. I've only tested so far with the multianimal branch, which is the branch everyone in our lab is working on.

happyqiu commented 2 years ago

Thanks for your update! But another error pops up which shows '...... "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

kristinbranson commented 2 years ago

Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check?

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

— Reply to this email directly, view it on GitHubhttps://github.com/kristinbranson/APT/issues/391#issuecomment-1146891638, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA. You are receiving this because you commented.Message ID: @.***>

happyqiu commented 2 years ago

I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) I also checked the backend configuration. Both APT activation and GPU tests passed.

On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson @.***> wrote:

Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check?

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

Thanks for your update! But another error pops up which shows '...... "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

— Reply to this email directly, view it on GitHub< https://github.com/kristinbranson/APT/issues/391#issuecomment-1146891638>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/391#issuecomment-1146894183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA . You are receiving this because you authored the thread.Message ID: @.***>

kristinbranson commented 2 years ago

You can switch to the multianimal branch with the command git checkout multianimal

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:32 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) I also checked the backend configuration. Both APT activation and GPU tests passed.

On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson @.***> wrote:

Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check?

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

Thanks for your update! But another error pops up which shows '...... "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

— Reply to this email directly, view it on GitHub< https://github.com/kristinbranson/APT/issues/391#issuecomment-1146891638>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/391#issuecomment-1146894183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA . You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/kristinbranson/APT/issues/391#issuecomment-1146895207, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA. You are receiving this because you commented.Message ID: @.***>

happyqiu commented 2 years ago

Thanks for your suggestion, but I still got the same error message with multianimal one. Traceback (most recent call last): File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 44, in import PoseUNet_dataset as PoseUNet File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseUNet_dataset.py", line 1, in from PoseCommon_dataset import PoseCommon, PoseCommonMulti, PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseCommon_dataset.py", line 26, in from tensorflow.contrib.layers import batch_norm File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow__init.py", line 50, in getattr module = self._load() File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow__init__.py", line 44, in _load module = _importlib.import_module(self.name) File "C:\Users\yuechen\anaconda3\envs\APT\lib\importlib__init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib__init.py", line 39, in from tensorflow.contrib import compiler File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler__init__.py", line 21, in from tensorflow.contrib.compiler import jit File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler__init__.py", line 22, in from tensorflow.contrib.compiler import xla File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\xla.py", line 22, in from tensorflow.python.estimator import model_fn as model_fn_lib File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\estimator\model_fn.py", line 26, in from tensorflow_estimator.python.estimator import model_fn File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\init__.py", line 10, in from tensorflow_estimator._api.v1 import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator_api\v1\estimator\init__.py", line 10, in from tensorflow_estimator._api.v1.estimator import experimental File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator_api\v1\estimator\experimental\init.py", line 10, in from tensorflow_estimator.python.estimator.canned.dnn import dnn_logit_fn_builder File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py", line 27, in from tensorflow_estimator.python.estimator import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' (C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\profiler\init__.py)

On Sun, Jun 5, 2022 at 6:37 PM Kristin Branson @.***> wrote:

You can switch to the multianimal branch with the command git checkout multianimal

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:32 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) I also checked the backend configuration. Both APT activation and GPU tests passed.

On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson @.***> wrote:

Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check?

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

Thanks for your update! But another error pops up which shows '......

"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

— Reply to this email directly, view it on GitHub< https://github.com/kristinbranson/APT/issues/391#issuecomment-1146891638 , or unsubscribe<

https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/kristinbranson/APT/issues/391#issuecomment-1146894183>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub< https://github.com/kristinbranson/APT/issues/391#issuecomment-1146895207>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/391#issuecomment-1146895836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHR2CLEWZQIXKNHSUM3VNUT3XANCNFSM5XOFD3QA . You are receiving this because you authored the thread.Message ID: @.***>

happyqiu commented 2 years ago

Hi, I'm thinking it may be a compatibility issue. I'm using CUDA 11.7. Which version of tensorflow-gpu do you think I should use?

On Mon, Jun 6, 2022 at 10:39 AM Karlie Qiu @.***> wrote:

Thanks for your suggestion, but I still got the same error message with multianimal one. Traceback (most recent call last): File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 44, in import PoseUNet_dataset as PoseUNet File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseUNet_dataset.py", line 1, in from PoseCommon_dataset import PoseCommon, PoseCommonMulti, PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseCommon_dataset.py", line 26, in from tensorflow.contrib.layers import batch_norm File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow__init.py", line 50, in getattr module = self._load() File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow__init__.py", line 44, in _load module = _importlib.import_module(self.name) File "C:\Users\yuechen\anaconda3\envs\APT\lib\importlib__init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib__init.py", line 39, in from tensorflow.contrib import compiler File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler__init__.py", line 21, in from tensorflow.contrib.compiler import jit File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler__init__.py", line 22, in from tensorflow.contrib.compiler import xla File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\contrib\compiler\xla.py", line 22, in from tensorflow.python.estimator import model_fn as model_fn_lib File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\estimator\model_fn.py", line 26, in from tensorflow_estimator.python.estimator import model_fn File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\init__.py", line 10, in from tensorflow_estimator._api.v1 import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator_api\v1\estimator\init__.py", line 10, in from tensorflow_estimator._api.v1.estimator import experimental File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator_api\v1\estimator\experimental\init.py", line 10, in from tensorflow_estimator.python.estimator.canned.dnn import dnn_logit_fn_builder File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\canned\dnn.py", line 27, in from tensorflow_estimator.python.estimator import estimator File "C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler' (C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_core\python\profiler\init__.py)

On Sun, Jun 5, 2022 at 6:37 PM Kristin Branson @.***> wrote:

You can switch to the multianimal branch with the command git checkout multianimal

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:32 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

I downloaded the APT-develop one. (https://github.com/kristinbranson/APT) I also checked the backend configuration. Both APT activation and GPU tests passed.

On Sun, Jun 5, 2022 at 6:24 PM Kristin Branson @.***> wrote:

Do you know which git branch of APT you are using? I only tested on the multianimal branch. Did you run the gpu/backend check?

For less delayed replies, please email @.*** I check my GMail account much more frequently than my HHMI mail.

From: happyqiu @.> Sent: Sunday, June 5, 2022 6:09 PM To: kristinbranson/APT @.> Cc: Branson, Kristin @.>; Comment @.> Subject: Re: [kristinbranson/APT] NaN training issue (Issue #391)

External Email: Use Caution

Thanks for your update! But another error pops up which shows '......

"C:\Users\yuechen\anaconda3\envs\APT\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 36, in from tensorflow.python.profiler import trace ImportError: cannot import name 'trace' from 'tensorflow.python.profiler'

When I simply typed in this command 'from tensorflow.python.profiler import trace' in the spyder, it seems to work well. Do you have any ideas of this?

Thanks!

— Reply to this email directly, view it on GitHub<

https://github.com/kristinbranson/APT/issues/391#issuecomment-1146891638 , or unsubscribe<

https://github.com/notifications/unsubscribe-auth/AABTTNEULYMBZTTBX7LUYW3VNUQPZANCNFSM5XOFD3QA

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub < https://github.com/kristinbranson/APT/issues/391#issuecomment-1146894183 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AMDHPHUWLRDHEDSS2ARINI3VNUSLVANCNFSM5XOFD3QA

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub< https://github.com/kristinbranson/APT/issues/391#issuecomment-1146895207>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AABTTNFHYM7HUBXKXQK3JADVNUTIDANCNFSM5XOFD3QA

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/391#issuecomment-1146895836, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHR2CLEWZQIXKNHSUM3VNUT3XANCNFSM5XOFD3QA . You are receiving this because you authored the thread.Message ID: @.***>

allenleetc commented 2 years ago

@happyqiu I'm using CUDA 11.6 and I believe the conda environment in the multianimal branch is working on my machine. I seem to be hitting a different bug, maybe about not having Git installed on my Windows machine; one step at a time though.

If you activate your APT conda environment, and do a conda list, what versions are shown for tensorflow and tensorflow-estimator?

To confirm, after changing to the multianimal branch, did you re-create your conda environment using the updated environment file at <APTRoot>\install\apt_conda_environment.yml? Eg did you run this:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

allenleetc commented 2 years ago

@happyqiu just FYI I pushed a minor fix and DLC Training is running successfully on my Windows machine. After we sort out your environment, it may be helpful to pull the latest before training.

allenleetc commented 2 years ago

@mkabra Training finished but the number of iterations on disk (7000) far exceeds my setting in the Tracking Parameters (1000). Tracking still proceeds but looks odd etc. Flagging maybe some kind of edge case for this net?

happyqiu commented 2 years ago

Hi Allen, The version for both tensorflow and tensorflow-estimator is 1.14.0. I reinstalled 10.0 CUDA and recreated the environment, and a different bug came up. The file does exist. Traceback (most recent call last): File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 4805, in main(sys.argv[1:]) File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\APT_interface.py", line 4772, in main repo_info = PoseTools.get_git_commit() File "C:\Users\yuechen\Documents\MATLAB\APT-multianimal\deepnet\PoseTools.py", line 1670, in get_git_commit label = subprocess.check_output(cmd).strip() File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 411, in check_output *kwargs).stdout File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 488, in run with Popen(popenargs, **kwargs) as process: File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 800, in init restore_signals, start_new_session) File "C:\Users\yuechen\anaconda3\envs\APT\lib\subprocess.py", line 1207, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified

On Wed, Jun 8, 2022 at 3:44 PM Allen Lee @.**@.>> wrote:

@happyqiuhttps://github.com/happyqiu I'm using CUDA 11.6 and I believe the conda environment is working on my machine. I seem to be hitting a different bug, maybe about not having Git installed on my Windows machine; one step at a time though.

If you activate your APT conda environment, and do a conda list, what versions are shown for tensorflow and tensorflow-estimator?

To confirm, after changing to the multianimal branch, did you re-create your conda environment using the updated environment file at \install\apt_conda_environment.yml? Eg did you run this:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

― Reply to this email directly, view it on GitHubhttps://github.com/kristinbranson/APT/issues/391#issuecomment-1150330045, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMDHPHVGD5SFCKHWYNVZYD3VODZY5ANCNFSM5XOFD3QA. You are receiving this because you were mentioned.Message ID: @.***>

allenleetc commented 2 years ago

OK, we may have made progress! I think this was the bug I fixed; did/can you pull the latest code in the multianimal branch? git pull

happyqiu commented 2 years ago

Hi Allen, looks like it starts training now! Thank you so much for your help!!!

On Wed, Jun 8, 2022 at 6:02 PM Allen Lee @.***> wrote:

OK, we may have made progress! I think this was the bug I fixed; did/can you pull the latest code in the multianimal branch? git pull

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/391#issuecomment-1150459864, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHU566MUAFXCI7HSYC3VOEJ6LANCNFSM5XOFD3QA . You are receiving this because you were mentioned.Message ID: @.***>

allenleetc commented 2 years ago

Great glad to hear it! Kristin did the hard part she updated the conda environment!

happyqiu commented 2 years ago

Thank you all! One more question, I found that during the training, only 1% of the GPU is used but 70% of CPU is used. Is that common? I thought the GPU was supposed to be used. Would that be because of the tensorflow version in the environment?

On Thu, Jun 9, 2022 at 10:55 AM Allen Lee @.***> wrote:

Great glad to hear it! Kristin did the hard part she updated the conda environment!

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/391#issuecomment-1151226387, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHSMS4RD3BDSSCCGVSLVOIAUTANCNFSM5XOFD3QA . You are receiving this because you were mentioned.Message ID: @.***>

allenleetc commented 2 years ago

I think this is not unusual as the training is often CPU-bound by the data pipeline (data read, augmentation+transformation etc).

If the training is proceeding at any kind of normal pace -- the Training Monitor is updating, etc -- then I think you should be utilizing your GPU.

That said @mkabra knows best, maybe he has thoughts. I also don't know how accurate the instrumentation is (eg task manager).

allenleetc commented 2 years ago

@happyqiu OK I may have spoken too soon, just FYI we are debugging chasing some things down here. Thanks for your patience while we debug.

allenleetc commented 2 years ago

@mkabra We are hitting https://github.com/open-mmlab/mmcv/pull/1543 while building mmcv-full on Win10. Do you need mmcv==1.3.3 or can we bump to mmcv==1.4.0 or higher? Alternatively, pytorch==1.8.0 or higher I guess.

allenleetc commented 2 years ago

@happyqiu Thanks for your patience! We have an interim fix for you while we iron out the conda environment in the multianimal branch.

The interim fix is in the develop branch. In your APT repo can you please go back to develop and pull the latest?

git checkout develop
git pull

You should now have an updated conda environment yml file in <APT>\install. Create the APT conda environment as before:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

I tested training/tracking with MDN and DLC. Just FYI I am testing with CUDA 11.7.

You were right that our last conda environment was not using the GPU! That was my mistake. During Training, you can confirm GPU usage by selecting "Show log files" in the Training Monitor and pressing "Go". For instance my logs for a recent train contain lines like the following:

2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5)

Please let us know if you can get going with this update!

happyqiu commented 2 years ago

Hi Allen,

Thanks for your update! I tried the latest APT_develop, and it worked! I got a similar log file as yours, and the GPU was pretty much used.

Really appreciate your help!

On Wed, Jun 15, 2022 at 4:05 PM Allen Lee @.***> wrote:

@happyqiu https://github.com/happyqiu Thanks for your patience! We have an interim fix for you while we iron out the conda environment in the multianimal branch.

The interim fix is in the develop branch. In your APT repo can you please go back to develop and pull the latest?

git checkout develop git pull

You should now have an updated conda environment yml file in \install. Create the APT conda environment as before:

conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force

I tested training/tracking with MDN and DLC. Just FYI I am testing with CUDA 11.7.

You were right that our last conda environment was not using the GPU! That was my mistake. During Training, you can confirm GPU usage by selecting "Show log files" in the Training Monitor and pressing "Go". For instance my logs for a recent train contain lines like the following:

2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5)

Please let us know if you can get going with this update!

— Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/391#issuecomment-1156879825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMDHPHV7SG63BT76ULG27PLVPIZPVANCNFSM5XOFD3QA . You are receiving this because you were mentioned.Message ID: @.***>

happyqiu commented 1 year ago

Hi Allen,

Sorry to trouble you again, but I got some new problems with the latest APT-develop.

It takes much much more time to build the training dataset than it used to be when I start training.
During training, the 'dist' plot in the monitor looks very weird (snapshot attached).

Have you ever seen these before?

On Thu, Jun 16, 2022 at 11:39 AM Karlie Qiu @.***> wrote:

Hi Allen,

Thanks for your update! I tried the latest APT_develop, and it worked! I got a similar log file as yours, and the GPU was pretty much used.

Really appreciate your help!

On Wed, Jun 15, 2022 at 4:05 PM Allen Lee @.***> wrote:

@happyqiu https://github.com/happyqiu Thanks for your patience! We have an interim fix for you while we iron out the conda environment in the multianimal branch.

The interim fix is in the develop branch. In your APT repo can you please go back to develop and pull the latest?

git checkout develop git pull

You should now have an updated conda environment yml file in
\install. Create the APT conda environment as before: conda env create -f c:\path\to\APT\install\apt_conda_environment.yml --force I tested training/tracking with MDN and DLC. Just FYI I am testing with CUDA 11.7. You were right that our last conda environment was not using the GPU! That was my mistake. During Training, you can confirm GPU usage by selecting "Show log files" in the Training Monitor and pressing "Go". For instance my logs for a recent train contain lines like the following: 2022-06-15 15:46:58.396255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll 2022-06-15 15:46:58.396272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2022-06-15 15:46:58.781641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-06-15 15:46:58.781660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2022-06-15 15:46:58.781665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2022-06-15 15:46:58.781761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6703 MB memory) -> physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:01:00.0, compute capability: 7.5) Please let us know if you can get going with this update! — Reply to this email directly, view it on GitHub , or unsubscribe . You are receiving this because you were mentioned.Message ID: ***@***.***>

allenleetc commented 1 year ago

Hey @happyqiu,

How long is it taking for training to start up? Can you share your project (.lbl) file?

For the 'dist' plot, I can't find the attachment, could you please re-attach? (Or if I am missing it please let me know!)

Thanks, Allen