Training errors - Githubissues

rgilchri commented 1 year ago

I am using a computer with the following configurations to try and use APT to track the movement of dogs kenneled at an animal shelter.

NVIDIA GeForce RTX 2070 CUDA 11.6.2 Python 3.9.13 Tensorflow 1.15 Matlab R2022a When I open the "Performance" tab in Task Manager and navigate to "GPU" it says

GPU memory: 24 GB (8GB dedicated, 16GB shared)
Hardware reserved memory: 161 MB

I’m using the “develop” branch of APT. I’m using a Local GPU back end, and tested the backend configuration with no issues (activated APT, found free GPUs). After labeling my frames in my test video, adjusting the tracking parameters to require 4.6 GB of GPU, and selecting MDN as the tracking algorithm, I clicked “Train.” The Training Monitor did appear, but no data points appeared and after a few minutes a popup appeared saying “Training stopped after NaN/60000 iterations. Save trained model to file?" I clicked “save” and went to the menu under the blank Training Monitor plot to see if I could get more information. Here’s what pressing “Go” on each of those options produced:

List all conda jobs: gives “no jobs running, no jobs queued” Show training job's status: this shows “ID 5, started 31-Aug-2022 19:26:03: finished” Show error messages: this shows “no error messages” Show log files: this gives Job 1 :### C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea\Rachel tries again\20220831T192550view0_20220831T192559_new.log file does not exist

I then clicked “Stop Training” and went to Matlab to see what was produced there, and have entered the Matlab log below:

StartAPT; Loading Java Customizations in UIExtrasTable.jar Labeler GUI created. Opening GUI took 10.447791 s Untarring project into C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea ... done with untar. Time to compute info statistic x = 0.111771 Warning: Label tags in previous frame not visualized. Time to compute info statistic x = 0.035300 Time to compute info statistic x = 0.038273 Time to compute info statistic x = 0.032887 Finished loading project, took 18.517456 s. Time to compute info statistic x = 0.040392 Time to compute info statistic x = 0.032052 Warning: Failed to update model iteration for model with net type mdn. Warning: Failed to update model iteration for model with net type deeplabcut. Tarring 5 model files into C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea Project saved to C:\Users\arren\APT\Ethan test APT file.lbl Warning: Failed to update model iteration for model with net type mdn. Warning: Failed to update model iteration for model with net type deeplabcut. Tarring 5 model files into C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea Project saved to C:\Users\arren\APT\Ethan test APT file.lbl Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 6). Training started at 31-Aug-2022 19:25:49... Your deep net type is: mdn Your training backend is: Conda Your training vizualizer is: TrainMonitorViz Tensorflow resnet pretrained weights http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz already downloaded. Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz already downloaded. Adding 50 new rows to data... ppdb addAndUpdate. 50/0/0 new/diff/same rows. training with 50 rows. training data summary: Group (mov): 1. nfrm=50, nfrmlbled=50. Stripped lbl preproc data cache: exporting 50/50 training rows. Shuffling training rows. Your RNG seed is: 17 Saved stripped lbl file: C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea\Rachel tries again\20220831T192550_20220831T192559.lbl Configuring background worker... Warning: Increasing current parpool IdleTimeout to 6000 minutes. activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\arren\APT\deepnet\APT_interface.py" -name 20220831T192550 -view 1 -cache "C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea" -err_file "C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea\Rachel tries again\20220831T192550view0_20220831T192559.err" -type mdn "C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea\Rachel tries again\20220831T192550_20220831T192559.lbl" train -use_cache > C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea\Rachel tries again\20220831T192550view0_20220831T192559_new.log 2>&1 Process job (movie 1, view 1) spawned, ID = 5: No training jobs running. - Stopping. Time to compute info statistic x = 0.039293 Time to compute info statistic x = 0.033359 Warning: Failed to update model iteration for model with net type mdn. Warning: Failed to update model iteration for model with net type deeplabcut. Tarring 4 model files into C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea Project saved to C:\Users\arren\APT\Ethan test APT file.lbl Directory C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea\Rachel tries again\mdn\view_0\20220831T192550 does not exist, creating. Created KILLED token: C:\Users\arren\OneDrive\Documents.apt\tpc3156bc9_6d15_472e_9c69_c605383f4bea\Rachel tries again\mdn\view_0\20220831T192550\20220831T192559_new.KILLED. Please wait for your training monitor to acknowledge the kill!

Although the Matlab log mentions an “err” file and a “log” file, these files do not exist in the generated folder “tpc3156bc9_6d15_472e_9c69_c605383f4bea” (I tried attaching the contents of this folder but it exceeds the 25MB upload limit). What jumps out at me in the Matlab log are the two lines that say “Warning: Failed to update model iteration for model with net type mdn/deeplabcut,” but I’m not sure what that means.

I would appreciate any guidance on this as I’m quite excited to see how APT could be used in an animal shelter setting! Thank you for reading this far, looking forward to troubleshooting together.

allenleetc commented 1 year ago

Hi @rgilchri! Sorry for the slow response.

I can see a couple things to try. The first is that I can try to reproduce the issue in the develop branch. If you open your project in APT and select File>Import/Export>Advanced>Export Training Data, you can export a MAT-file containing your project contents; I'm guessing this file will not be too large to zip and share here and then I can check things out.

Another option is to try out the latest version of APT in the multianimal branch. This option is a heavier lift, because on Windows you will need to set up a new Docker backend (still in beta, see below). The upside is that this code contains many updates, including a new default network (GRONe), and your issue may already be fixed with these updates.

To try this out, i) change your git branch to multianimal, eg via git checkout multianimal; ii) set up a Windows Docker backend as described in (beta)-Windows-&-Docker-on-WSL2-Setup-Instructions.

Let us know what you think! Happy to do a quick debug run on develop; then if you plan to use APT for at least a little while, the second option may be worth trying out in any case.

mkabra commented 1 year ago

@rgilchri, if the files are too large you can share them via Google Drive or other cloud storage service.

rgilchri commented 1 year ago

Thank you for the responses @allenleetc and @mkabra! It looks like I'm not able to attach the MAT-file (GitHub says file type not supported) but I've put that file as well as the files from my original comment into a Google Drive folder here: https://drive.google.com/drive/folders/1wz1zmPnOrBWSykFeIG0iTJhoYWzxCyrs?usp=sharing. I would really appreciate the quick debug option on the develop branch if possible before switching to multianimal, so please let me know if you need access to any other files in order to take a look!

mkabra commented 1 year ago

Hi @rgilchri, I was able to train a tracker using the files that you have sent on linux, so I think the issue is with the Conda environment or the file system on windows.

Is C:\Users\arren\OneDrive\ local or on the cloud? If it is on the cloud, can you change the cache directory to point to a local directory and train again? To do this, first, copy the Manifest.sample.txt file to Manifest.txt in the APT directory if it doesn't exist. Then change the "dltemproot,/path/to/dl/cachedir" to point to a local directory (Eg "dltemproot,C:\Users\arren\APT_temp"). Make sure to create the directory if it doesn't exist. Once you update the Manifest.txt restart Matlab and train again.

allenleetc commented 1 year ago

Hey guys

I noticed something else that might be worth trying @rgilchri. Your project name is 'Rachel tries again', which contains spaces and can cause filesystem issues on Windows as @mkabra suggested. You can change this name to something that doesn't contain spaces:

lObj = StartAPT; 

% Load your project in the GUI

lObj.projname = 'RachelTriesAgain';

% Save your project and try Training!

I experimented on Windows and was able to reproduce and then fix the problem in this way. Hope this gets you going let us know!

rgilchri commented 1 year ago

Thank you so much, this immediately fixed the issue! I am so appreciative of your support!

Best, Rachel

On Thu, Oct 13, 2022 at 4:53 PM Allen Lee @.***> wrote:

Hey guys

I noticed something else that might be worth trying @rgilchri https://urldefense.com/v3/__https://github.com/rgilchri__;!!IKRxdwAv5BmarQ!duk1z0vPDkT2ItVAk0ss9x33m6Pc6GelRQLtOYGN6JFsX40tiKdvSn0wJHsSiD9l1ZApSztsv1mwCahN9ewZl74$. Your project name is 'Rachel tries again', which contains spaces and can cause filesystem issues on Windows as @mkabra https://urldefense.com/v3/__https://github.com/mkabra__;!!IKRxdwAv5BmarQ!duk1z0vPDkT2ItVAk0ss9x33m6Pc6GelRQLtOYGN6JFsX40tiKdvSn0wJHsSiD9l1ZApSztsv1mwCahNlf2nbhA$ suggested. You can change this name to something that doesn't contain spaces:

lObj = StartAPT;

% Load your project in the GUI

lObj.projname = 'RachelTriesAgain';

% Save your project and try Training!

I experimented on Windows and was able to reproduce and then fix the problem in this way. Hope this gets you going let us know!

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/405*issuecomment-1278293887__;Iw!!IKRxdwAv5BmarQ!duk1z0vPDkT2ItVAk0ss9x33m6Pc6GelRQLtOYGN6JFsX40tiKdvSn0wJHsSiD9l1ZApSztsv1mwCahNPpHiDLM$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A25OEPSH4TD6GQK6QVYASU3WDCOFRANCNFSM6AAAAAAQWKZ4FI__;!!IKRxdwAv5BmarQ!duk1z0vPDkT2ItVAk0ss9x33m6Pc6GelRQLtOYGN6JFsX40tiKdvSn0wJHsSiD9l1ZApSztsv1mwCahN7hH60fg$ . You are receiving this because you were mentioned.Message ID: @.***>

allenleetc commented 1 year ago

Sweet! Let us know how it goes! Allen

kristinbranson / APT

Training errors #405