Open rl72 opened 3 years ago
Good questions, would it be possible for you to share with us the DLC project so we can debug? I suspect the issues are:
On Wed, Jun 30, 2021 at 5:10 AM rl72 @.***> wrote:
External Email: Use Caution
Hi,
I tried to import a DLC project (with 1000+ labeled frames, good tracking) and retrain the network in APT so that I can then use it in JAABA. So far, the project was successfully loaded in APT and the backend test was passed for local (Conda), but when I start the training, I get the following error message : "Error using DeepTracker/genContainerMountPathBsubDocker (line 1465) Assertion failed." Also, only a part of the labeled frames were read, and I got the following warning message : "Warning: Not including 965 partially-labeled rows." What can be done to avoid losing the frames where not all the points are labeled ?
Thanks a lot, Luminita
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kristinbranson/APT/issues/361, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABTTNHAQPMWT4VSXZCL6HLTVLNO5ANCNFSM47R4UXPQ .
Hi there, We are experiencing the same error, which says "Error Using DeepTracker/ContainerMountPathBsubDocker (line 1465) Assertion failed." We recapitulated this error on two Windows devices with a local GPU conda backend without importing a DLC project and locally labeling frames. Hope this adds better context for the error, best wishes.
Hey Kristin,
Thanks a lot for your answer! I have emailed to you the DLC project, let me know if you need anything else. We can try to set up the docker backend in the meanwhile and let you know how it goes. Also please let me know how to change the flags for the unlabeled points.
Thanks, Luminita
@rl72 @RitvikTeegavarapu Hi guys, I just pushed a fix for a regression we had in the Conda backend. Please pull the latest and we can try again.
After the fix, I can successfully train and track with MDN, our in-house network. Just FYI, on my machine, DLC is acting a little strangely at the moment and hanging up during training -- not entirely sure what is going but I am investigating. This could just be a temporary issue with my Windows machine as its GPU can be finicky.
Regarding the partially-labeled rows, from your project it looks like the animal landmarks are almost always labeled, but one point (an external object?) is only sparsely labeled. Mayank @mkabra any thoughts?
Hi, thanks a lot for your reply. I will try again with the latest version and let you know how it goes.
As for the partially labeled rows, it is indeed an external object that is only present sparsely, maybe one option could be to turn its detection for the training/tracking off altogether?
Luminita
@allenleetc Hi there, thank you for responding so quickly. We attempted to restart the program and it still produced the same error. If there is any other fix you can think of, we would be very grateful!
@RitvikTeegavarapu Whoops I committed but hadn't pushed! My mistake. This time I pushed it, I see the commit above this message. Can you please pull and try again?
Hi, we are now getting a new error that says "Training stopped after NaN/20000 iterations. Save trained model to file?" It seems to stop the training in the middle, is there possibly a solution to this?
OK, it sounds like we are past the first error but there is a new one. In the Training Monitor, there are options in the pulldown menu at the bottom for "Show Log Files" and "Show Error Files". Can you select those options (and press Go after each) and cut+paste the results here? Alternatively, when training fails there will usually be a large error trace printed out in your MATLAB Command Window. This is often identical to the logs, but if not, that could be useful information to cut+paste here as well.
The following are the errors we received.
C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\PhillipsM.conda\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
File "C:\Users\PhillipsM\Documents\MATLAB\APT-develop\deepnet\APT_interface.py", line 24, in
Hi again, we encountered the same issue as @RitvikTeegavarapu. When I select "Show Error Messages" in the pulldown menu of the Training Monitor, it is indicated "No error messages."
As for the log file :
C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\realtime\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.
Traceback (most recent call last):
File "E:\APT-develop\deepnet\APT_interface.py", line 35, in
Great, thanks for these details! I think I am reproducing to a large extent and can guess the issue. Will be back in touch with a fix.
Hey guys, quick update on this. I pushed an update to our Conda environment/yaml to bring it up-to-date with recent development. This fell through the cracks, thanks for your patience.
I am able to train and track now with MDN, but there may still be an issue with DLC that we are following up on. This may take a little time. @RitvikTeegavarapu since you are running MDN, you could give this a try now. Here were my steps to update my APT Conda environment:
conda remove --name APT --all
conda env create -f d:\path\to\APT\condaenv\env.yml
@rl72, it looks like you may be trying to train DLC -- you might want to hold off for now as we will be trying some more testing and have some changes pending on the import and partially-labeled rows issue.
@allenleetc We are getting the same error after pulling the latest version of APT from the github. However, we are going to transition soon to Linux for a multitude of reasons, and will update you accordingly on our progress with that system. Thank you so much for assisting us!
OK, sounds good. I'm surprised you got the same error (did you remove and recreate the conda environment using the new yml?) -- but with your move to Linux this sounds moot.
@rl72 I pushed updates so that unlabeled landmarks in the DLC project will be considered "fully occluded" in APT and then passed along to the backend network (DLC or MDN). It may be worth giving things a try now -- as a first step, please pull the latest updates and update your conda environment according to the steps above.
We still have some Windows/conda testing pending. Also flagging @mkabra as it sounds like passing a large number of fully-occluded/missing landmarks to DLC wasn't something we had tested out heavily in the past.
Hi @allenleetc, we pulled the latest updates and updated the conda environment too, yet I got the same error as before ("Training stopped after NaN/500000 iterations. Save trained model to file?"), I am sending below the new log entries from the Training monitor.
Also, only the fully labeled frames were included for training, just like before. If there is anything else we can do on our side, please let me know. Thanks a lot for your quick updates and responsiveness!
2021-07-02 19:14:54.302963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
@rl72 Ah interesting ok. For the partially-labeled row issue, I should clarify you will need to run the createFromDeeplabcut script again to regenerate the APT project from the DLC project. If you didn't do that then that would explain why nothing changed.
The other error is independent and guessing should be easy to resolve. I will get to that soon then when I come back for the additional testing. You can probably hold off then for now as even if you regenerate the project you will still hit this error.
Thanks for your patience with this startup pain! Lots of active development recently and Windows also doesn't get as much testing.
Hey @rl72, I pushed a fix for this new/latest issue. With this fix, I am able to train DLC successfully in the new Conda environment. Let us know if you can get running! As I mentioned, in addition to the new Conda environment, please regenerate the APT project using createFromDeeplabcut, as I have updated that utility to include the partially-labeled rows.
Hi,
I tried to import a DLC project (with 1000+ labeled frames, good tracking) and retrain the network in APT so that I can then use it in JAABA. So far, the project was successfully loaded in APT and the backend test was passed for local (Conda), but when I start the training, I get the following error message : "Error using DeepTracker/genContainerMountPathBsubDocker (line 1465) Assertion failed." Also, only a part of the labeled frames were read, and I got the following warning message : "Warning: Not including 965 partially-labeled rows." What can be done to avoid losing the frames where not all the points are labeled ?
Thanks a lot, Luminita