kristinbranson / APT

Animal Part Tracker
GNU General Public License v3.0
71 stars 16 forks source link

Assertion failed + partially-labeled rows #361

Open rl72 opened 3 years ago

rl72 commented 3 years ago

Hi,

I tried to import a DLC project (with 1000+ labeled frames, good tracking) and retrain the network in APT so that I can then use it in JAABA. So far, the project was successfully loaded in APT and the backend test was passed for local (Conda), but when I start the training, I get the following error message : "Error using DeepTracker/genContainerMountPathBsubDocker (line 1465) Assertion failed." Also, only a part of the labeled frames were read, and I got the following warning message : "Warning: Not including 965 partially-labeled rows." What can be done to avoid losing the frames where not all the points are labeled ?

Thanks a lot, Luminita

kristinbranson commented 3 years ago

Good questions, would it be possible for you to share with us the DLC project so we can debug? I suspect the issues are:

On Wed, Jun 30, 2021 at 5:10 AM rl72 @.***> wrote:

External Email: Use Caution

Hi,

I tried to import a DLC project (with 1000+ labeled frames, good tracking) and retrain the network in APT so that I can then use it in JAABA. So far, the project was successfully loaded in APT and the backend test was passed for local (Conda), but when I start the training, I get the following error message : "Error using DeepTracker/genContainerMountPathBsubDocker (line 1465) Assertion failed." Also, only a part of the labeled frames were read, and I got the following warning message : "Warning: Not including 965 partially-labeled rows." What can be done to avoid losing the frames where not all the points are labeled ?

Thanks a lot, Luminita

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/361, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABTTNHAQPMWT4VSXZCL6HLTVLNO5ANCNFSM47R4UXPQ .

RitvikTeegavarapu commented 3 years ago

Hi there, We are experiencing the same error, which says "Error Using DeepTracker/ContainerMountPathBsubDocker (line 1465) Assertion failed." We recapitulated this error on two Windows devices with a local GPU conda backend without importing a DLC project and locally labeling frames. Hope this adds better context for the error, best wishes.

rl72 commented 3 years ago

Hey Kristin,

Thanks a lot for your answer! I have emailed to you the DLC project, let me know if you need anything else. We can try to set up the docker backend in the meanwhile and let you know how it goes. Also please let me know how to change the flags for the unlabeled points.

Thanks, Luminita

allenleetc commented 3 years ago

@rl72 @RitvikTeegavarapu Hi guys, I just pushed a fix for a regression we had in the Conda backend. Please pull the latest and we can try again.

After the fix, I can successfully train and track with MDN, our in-house network. Just FYI, on my machine, DLC is acting a little strangely at the moment and hanging up during training -- not entirely sure what is going but I am investigating. This could just be a temporary issue with my Windows machine as its GPU can be finicky.

Regarding the partially-labeled rows, from your project it looks like the animal landmarks are almost always labeled, but one point (an external object?) is only sparsely labeled. Mayank @mkabra any thoughts?

rl72 commented 3 years ago

Hi, thanks a lot for your reply. I will try again with the latest version and let you know how it goes.

As for the partially labeled rows, it is indeed an external object that is only present sparsely, maybe one option could be to turn its detection for the training/tracking off altogether?

Luminita

RitvikTeegavarapu commented 3 years ago

@allenleetc Hi there, thank you for responding so quickly. We attempted to restart the program and it still produced the same error. If there is any other fix you can think of, we would be very grateful!

allenleetc commented 3 years ago

@RitvikTeegavarapu Whoops I committed but hadn't pushed! My mistake. This time I pushed it, I see the commit above this message. Can you please pull and try again?

RitvikTeegavarapu commented 3 years ago

Hi, we are now getting a new error that says "Training stopped after NaN/20000 iterations. Save trained model to file?" It seems to stop the training in the middle, is there possibly a solution to this?

allenleetc commented 3 years ago

OK, it sounds like we are past the first error but there is a new one. In the Training Monitor, there are options in the pulldown menu at the bottom for "Show Log Files" and "Show Error Files". Can you select those options (and press Go after each) and cut+paste the results here? Alternatively, when training fails there will usually be a large error trace printed out in your MATLAB Command Window. This is often identical to the logs, but if not, that could be useful information to cut+paste here as well.

RitvikTeegavarapu commented 3 years ago

The following are the errors we received.

Job 1:

C:\Users\PhillipsM\Documents.apt\tp070e6157_25b8_4f23_aeff_a1bd3734260c\t\20210630T152743view0_20210630T152745_new.log

C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Traceback (most recent call last): File "C:\Users\PhillipsM\Documents\MATLAB\APT-develop\deepnet\APT_interface.py", line 24, in import PoseUNet_dataset as PoseUNet File "C:\Users\PhillipsM\Documents\MATLAB\APT-develop\deepnet\PoseUNet_dataset.py", line 1, in from PoseCommon_dataset import PoseCommon, PoseCommonMulti, PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut File "C:\Users\PhillipsM\Documents\MATLAB\APT-develop\deepnet\PoseCommon_dataset.py", line 25, in import PoseTools File "C:\Users\PhillipsM\Documents\MATLAB\APT-develop\deepnet\PoseTools.py", line 8, in from past.builtins import cmp ModuleNotFoundError: No module named 'past' error jpeg

rl72 commented 3 years ago

Hi again, we encountered the same issue as @RitvikTeegavarapu. When I select "Show Error Messages" in the pulldown menu of the Training Monitor, it is indicated "No error messages."

As for the log file :

Job 1:

C:\Users\realtime\Documents.apt\tp5fba5621_a854_4db6_9801_6747f5690af2\test\20210630T214045view0_20210630T214048_new.log

C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Using TensorFlow backend. Traceback (most recent call last): File "E:\APT-develop\deepnet\APT_interface.py", line 35, in from deeplabcut.pose_estimation_tensorflow.train import train as deepcut_train File "E:\APT-develop\deepnet\deeplabcut__init.py", line 51, in from deeplabcut.pose_estimation_tensorflow import train_network, return_train_network_path File "E:\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow\init__.py", line 18, in from deeplabcut.pose_estimation_tensorflow.evaluate import * File "E:\APT-develop\deepnet\deeplabcut\pose_estimation_tensorflow\evaluate.py", line 18, in import pandas as pd ModuleNotFoundError: No module named 'pandas'

allenleetc commented 3 years ago

Great, thanks for these details! I think I am reproducing to a large extent and can guess the issue. Will be back in touch with a fix.

allenleetc commented 3 years ago

Hey guys, quick update on this. I pushed an update to our Conda environment/yaml to bring it up-to-date with recent development. This fell through the cracks, thanks for your patience.

I am able to train and track now with MDN, but there may still be an issue with DLC that we are following up on. This may take a little time. @RitvikTeegavarapu since you are running MDN, you could give this a try now. Here were my steps to update my APT Conda environment:

  1. Pull latest changes from GitHub
  2. conda remove --name APT --all
  3. conda env create -f d:\path\to\APT\condaenv\env.yml

@rl72, it looks like you may be trying to train DLC -- you might want to hold off for now as we will be trying some more testing and have some changes pending on the import and partially-labeled rows issue.

RitvikTeegavarapu commented 3 years ago

@allenleetc We are getting the same error after pulling the latest version of APT from the github. However, we are going to transition soon to Linux for a multitude of reasons, and will update you accordingly on our progress with that system. Thank you so much for assisting us!

allenleetc commented 3 years ago

OK, sounds good. I'm surprised you got the same error (did you remove and recreate the conda environment using the new yml?) -- but with your move to Linux this sounds moot.

allenleetc commented 3 years ago

@rl72 I pushed updates so that unlabeled landmarks in the DLC project will be considered "fully occluded" in APT and then passed along to the backend network (DLC or MDN). It may be worth giving things a try now -- as a first step, please pull the latest updates and update your conda environment according to the steps above.

We still have some Windows/conda testing pending. Also flagging @mkabra as it sounds like passing a large number of fully-occluded/missing landmarks to DLC wasn't something we had tested out heavily in the past.

rl72 commented 3 years ago

Hi @allenleetc, we pulled the latest updates and updated the conda environment too, yet I got the same error as before ("Training stopped after NaN/500000 iterations. Save trained model to file?"), I am sending below the new log entries from the Training monitor.
Also, only the fully labeled frames were included for training, just like before. If there is anything else we can do on our side, please let me know. Thanks a lot for your quick updates and responsiveness!

Job 1:

C:\Users\realtime\Documents.apt\tp96526766_0693_493a_bac2_47d54fb88d53\test\20210702T190724view0_20210702T191448_new.log

2021-07-02 19:14:54.302963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

allenleetc commented 3 years ago

@rl72 Ah interesting ok. For the partially-labeled row issue, I should clarify you will need to run the createFromDeeplabcut script again to regenerate the APT project from the DLC project. If you didn't do that then that would explain why nothing changed.

The other error is independent and guessing should be easy to resolve. I will get to that soon then when I come back for the additional testing. You can probably hold off then for now as even if you regenerate the project you will still hit this error.

Thanks for your patience with this startup pain! Lots of active development recently and Windows also doesn't get as much testing.

allenleetc commented 2 years ago

Hey @rl72, I pushed a fix for this new/latest issue. With this fix, I am able to train DLC successfully in the new Conda environment. Let us know if you can get running! As I mentioned, in addition to the new Conda environment, please regenerate the APT project using createFromDeeplabcut, as I have updated that utility to include the partially-labeled rows.