"NaN" Training: git repo error?

kristinbranson / APT

Animal Part Tracker

GNU General Public License v3.0

71 stars 16 forks source link

"NaN" Training: git repo error? #365

Open PhillipsML opened 2 years ago

PhillipsML commented 2 years ago

Hi - I'm trying to train in APT and running into a weird issue: The training appeared to be running with no errors in the log, but the loss graphs at the top had no data points. Upon further review, it appears the values are nan which could explain the blank graphs. Working my way up in the log files, I found a "fatal: not a git repository" which leads to a "stopping at filesystem boundary". I'm not sure if this would cause the NaN issue, and have spent some time verifying and re-verifying that indeed the APT folder is recognized by Matlab as a git repository. I'm in ubuntu 20.04, matlab 2021a. Docker backend: passes all APT tests for access to GPU. I would appreciate any insight you would have into this issue - really looking forward to getting APT up and running! I've attached the log files, as I mentioned there were no error files. Best- Mary

LogCodes_NaN.odt

allenleetc commented 2 years ago

Hey Mary, Good to hear from you! Hmm, that is strange. Yes we noticed the git repository message recently -- I am guessing this is not the cause of your issue though, as your optimization does start and run. This message looked a little concerning:

RuntimeWarning: Mean of empty slice label_mean = np.nanmean(val_dist)

Guessing you are using the MDN network? Are you able to share your project? I think just the .lbl file (without any movies) might be helpful. Mayank @mkabra any ideas?

PhillipsML commented 2 years ago

Hi Allen! Been a long time - hope all is well! I am using the MDN, but recapitulated the same issue when I tried with the DLC. Further detail: the training hangs up "building training dataset" for quite a long time before progressing to the "training", and will continue iterating until killed. Below is the link to the .lbl https://maxplanckflorida-my.sharepoint.com/:u:/g/personal/phillipsm_mpfi_org/ESpOXtKt2qJCr8vJmcyhzJwB2YVg95wNgVW_AB5L18ObhA?e=fJQ2dO

allenleetc commented 2 years ago

Hey Mary, your link seem to require maxplanck credentials, does that sound right? I couldn't access the file.

Also, just FYI that HHMI has a holiday this coming week so we will be slower to respond.

PhillipsML commented 2 years ago

Ah - sorry about that. Better link below. And no worries about the holiday, appreciate any advice you guys can give https://drive.google.com/file/d/1VUQaVihKRe1OBwJVQXBXQzO3Q7ZL2t_0/view?usp=sharing

allenleetc commented 2 years ago

Huh, strange, so far I don't seem to reproduce the NaN losses running with the latest code on develop. Will think some more and Mayank will likely have ideas.

Slightly off topic but maybe relevant. Your project cache suggests you have tried Restarting the training (this would be via the Blue "Restart" button on the Training Monitor.) Did/do you press this button as part of your workflow?

allenleetc commented 2 years ago

@PhillipsML OK a bit more digging and so far we can't reproduce this issue with the cached state in the project. For a next iteration of debugging any of the following would be helpful:

We pushed a few small changes to the APT repo, so it may be useful to pull those first before proceeding.
Can you manually confirm the git commit (SHA) for your APT repo? You can do this in a Linux terminal with eg git describe while in the repo directory; followed by a git status which will detect any local changes. (Did you clone the APT repo using the Linux command line, or another tool?)
Is it possible to include your movies in the Google Drive so we can try training start-to-finish?
A cut+paste of everything printed in the MATLAB command window at the start of the train (from the time you press Train until after the Train is initiated and the Training Monitor comes up) could be useful. (This can be pretty long, but by the time a single iteration is done it should be done printing.)
As I mentioned above, if you ever 'Restart' your training please let us know. We are debugging as if the issue occurs on a fresh Train.

Thanks! Allen

PhillipsML commented 2 years ago

@allenleetc -I've repulled APT, so recapitulated the error with the new updates

I've tried with both linux terminal "git clone" and with matlab github interface (this was when trying to get rid of the not a git directory error)
git desribe: v2.0-3036-ga5329647
git status: Your branch is up to date with 'origin/develop'.
This drive has files I've been working with, one project with one mouse and the other 3. On the 3 mice, I've tried DLC and MDN to see if I can fix the error. In the DLC, it did seem to train alright once, but has since begun hanging up in the building training image dataset phase. MDN has not worked. I've tried finding the most helpful log files: I've been trying a lot of different variations to get around the issue. https://drive.google.com/drive/folders/1erJNK0Et3kdYDUguh3g1YyZOhy379O9b?usp=sharing
I'm not sure about the differences between DLC and MDN training workflows, but when I was combing through the .apt logfiles I found a DLC training image set whereas the MDN did not get to that stage.
I have restarted the training (in the OneMouse) as the log file failed to update in more than 5 minutes and in the text it suggesting stopping and restarting. I have not restarted in some instances and run into the same error.
The Matlab command window output is saved in the 3 mouse folder.

I had APT working well on an older linux machine but when switching to our new dataprocessing computer have had these issues... So excited to take our new GPU for a ride though! I really appreciate your help with this. I'll keep on trying on my end to get around this, let me know if any additional files / information would be helpful for you.

Best - Mary

mkabra commented 2 years ago

Hi Mary,

Could you also add the trx file for the 3 mice project? We didn't realize that the project had trx files. In the meanwhile, I created a dummy trx file and was able to train an MDN tracker using docker backend and didn't encounter any NaN issues. I also checked and trained using the training database file (train_TF.tfrecords) you sent and that too looks fine.

I'm at a loss why it works for us and not for you. Would it be possible to test the project on a different machine/GPU? What GPU are you using? I will also try to figure out some other ways to debug this.

Mayank

On Fri, Aug 13, 2021 at 1:27 AM PhillipsML @.***> wrote:

External Email: Use Caution

@allenleetc https://urldefense.com/v3/__https://github.com/allenleetc__;!!Eh6p8Q!QOZ3kqRbPlzW3h-7NrYtT8ceNnnkjsVzuFYk3S1y_HKnWKQ2YTd9oLbJV9l2mgW9KyQ$ -I've repulled APT, so recapitulated the error with the new updates

I've tried with both linux terminal "git clone" and with matlab github interface (this was when trying to get rid of the not a git directory error)

git desribe: v2.0-3036-ga5329647

git status: Your branch is up to date with 'origin/develop'.

This drive has files I've been working with, one project with one mouse and the other 3. On the 3 mice, I've tried DLC and MDN to see if I can fix the error. In the DLC, it did seem to train alright once, but has since begun hanging up in the building training image dataset phase. MDN has not worked. I've tried finding the most helpful log files: I've been trying a lot of different variations to get around the issue. https://drive.google.com/drive/folders/1erJNK0Et3kdYDUguh3g1YyZOhy379O9b?usp=sharing https://urldefense.com/v3/__https://drive.google.com/drive/folders/1erJNK0Et3kdYDUguh3g1YyZOhy379O9b?usp=sharing__;!!Eh6p8Q!QOZ3kqRbPlzW3h-7NrYtT8ceNnnkjsVzuFYk3S1y_HKnWKQ2YTd9oLbJV9l2l-ndlm4$

I'm not sure about the differences between DLC and MDN training workflows, but when I was combing through the .apt logfiles I found a DLC training image set whereas the MDN did not get to that stage.

I have restarted the training (in the OneMouse) as the log file failed to update in more than 5 minutes and in the text it suggesting stopping and restarting. I have not restarted in some instances and run into the same error.

The Matlab command window output is saved in the 3 mouse folder.

I had APT working well on an older linux machine but when switching to our new dataprocessing computer have had these issues... So excited to take our new GPU for a ride though! I really appreciate your help with this. I'll keep on trying on my end to get around this, let me know if any additional files / information would be helpful for you.

Best - Mary

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/365*issuecomment-897927030__;Iw!!Eh6p8Q!QOZ3kqRbPlzW3h-7NrYtT8ceNnnkjsVzuFYk3S1y_HKnWKQ2YTd9oLbJV9l2iOy5z08$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY3FLA4IDDJREEFHUNLT4QRUHANCNFSM5BWJXZTA__;!!Eh6p8Q!QOZ3kqRbPlzW3h-7NrYtT8ceNnnkjsVzuFYk3S1y_HKnWKQ2YTd9oLbJV9l2RPy3YnI$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!Eh6p8Q!QOZ3kqRbPlzW3h-7NrYtT8ceNnnkjsVzuFYk3S1y_HKnWKQ2YTd9oLbJV9l2rAkijIE$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email__;!!Eh6p8Q!QOZ3kqRbPlzW3h-7NrYtT8ceNnnkjsVzuFYk3S1y_HKnWKQ2YTd9oLbJV9l2_i7h2CM$ .

allenleetc commented 2 years ago

I wonder if GPU memory could be a factor. Mary have you ever tried reducing your 'Training batch size'? When setting Tracking Parameters, this is under DeepTrack>GradientDescent>Training batch size.

Just had an experience that may be relevant. I opened your OneMouse project and removed movies 2-5 as those were not in the Google Drive (this should be pretty fine for testing). On my first train of MDN (on Docker/Ubuntu), I left all parameters unchanged. My GPU is an RTX 2080 Ti, which has 11GB of GPU RAM. The Train hung up on the "Building training database" stage, but without throwing any out-of-memory errors or the like. Everything just hung up.

Then I ran the train directly from the commandline, still with batch size = 8. This time, I got out-of-memory errors! Not sure why it threw errors this time and not before.

Finally, I reduced the batch size to 2 within the APT GUI, and this time the train ran successfully. The Tracking Parameters resource estimator suggests that GPU memory could be an issue (for my card) with batch size set to 8. It is estimating a GPU memory requirement of ~19GB which would require a pretty chunky GPU card.

Another option to reduce the GPU memory requirement would be to increase the ImageProcessing > Downsample factor. Just some ideas maybe worth trying if a resource constraint could be at play.

PhillipsML commented 2 years ago

@allenleetc you may be on to something! I just set up a new lbl to try some different parameters: using DLC at the moment and it's running. I've checked my GPU usage and I'm about maxed, my GeForce RTX 3090 is 24 GB. I wonder if when I use the MDN without downsampling I just hit a memory wall. I will try that next and see. @mkabra I've added the trx files for the 3 mouse project, GPU is the GeForce RTX 3090. Let me look into the memory load of what I was trying to train and see if that's why I'm being hung up at the building database stage. Thanks so much - I'll play some more and get back to you

Hexusprime commented 2 years ago

Hi there! Nice to meet you all! :)

My name is Matthew, I’m part of the helpdesk staff here at MPFI.

Just wanted to add a bit onto what I’ve found in regards to this current issue:

With a fresh install of Ubuntu 20.04, following the install instructions and then running the docker backend test, all tests well and it seems that the computer should be ready to train.

The issue arises when going to train, the training monitor appears, but nothing ever happens afterwards sadly.

I can confirm that the process has started as well by leaving a terminal window running in the backround with this command , “watch -d -n 0.5 nvidia-smi” , you can see the process appear and the gpu temp/memory usage slowly go up.

Unfortunately no blue line ever appears afterwards.

I’ve tried configuring the tracker as well to take smaller batches (2) and down sampling all the way to 4 at the same time to see if there was some invisible limit being hit, but the same results happen.

The training monitor will open and it will hang on the first job and never progress, despite iterations completing I’ll only see a single point, never a line.

Note that I’ve set this program up the same way on another machine using the GTX 1080 and that starts training/works flawlessly with no issues, however on this machine with the RTX 3090, it refuses to progress.

I don’t see any error messages in the training monitor either, and I’m currently using the latest pull of APT as well.

Mary has mentioned it’s because it’s training with NaN’s, and when you go to stop training it even mentions as such:

Is there a way I could share some system logs or anything at all to help troubleshoot this? I’d also be open to zooming to show you the issue at hand if you’d be interested in taking a look.

Thank you so much for all your help, super excited to see the end results of using this powerful card in this process. :)

Matthew Morgan ITS Helpdesk Tech.

allenleetc commented 2 years ago

Hey Matthew,

Thanks for the detailed report! It certainly does look like you are getting NaNs during training. One way to confirm is to select "Show log files" in the Training Monitor and press "Go" to get a print-out of the training log; if you can attach/cut+paste the entire log here, that would be useful.

Another useful thing would be to Save (or Save as...) the project after stopping the train (ideally, let it run for a brief while). This is as prompted by the last Dialog box in your report. Just the project (.lbl file) would be useful. Is the project basically similar to Mary's "OneMouse" and "ThreeMouse" project she includes above?

@mkabra This seems to be suggesting that the RTX 3090 is the culprit do you agree? A quick search has turned up a bunch of similar reports.

Hexusprime commented 2 years ago

Hi Allen!

Thank you so much for the quick response! I'll get that running now and should have those files available for you tomorrow morning along with the log.

I'll also confirm with Mary tomorrow the similarity in the projects.

Thanks! Matthew

mkabra commented 2 years ago

Wow, TensorFlow continues to impress on how awfully it is managed! Reading online, it seems 3090 is incompatible with all versions of TF earlier than 2.3. Updating TF in our workflow is no small task because we have to run tons of testing (which is something that TF apparently doesn't seem to do enough). So we won't be updating the image anytime soon.

Since you do have a powerful GPU that you would like to use, we would suggest using networks that are implemented in PyTorch. However, these networks are not available in the default "develop" branch of APT. You could use the latest "param" branch, but you need to understand the risk of using this branch. It is under heavy active development and you are likely to encounter a lot of bugs. And the networks themselves may undergo changes. So if you ever pull a new version, it is likely that the previously trained tracker may not work and you might have to train again. It is extremely unlikely, but there could also be cases where you might not be able to open a saved project either. If that happens, we could help you recover the labels from the saved project, but it could take time.

The networks that you can use in the "param" branch that uses PyTorch are GRONe (recommended) and MSPN. You should try to load your old project and train one of these networks, but keep your old project still around in case anything goes wrong. A suggestion is to keep saving projects at various stages (after labeling, after training) so that you can revert to an old version in case something goes wrong.

We apologize for this mess, and we have tried really hard to work around TF's brittleness but they still seem to manage to break our workflows. We have been moving to PyTorch and are adding better and faster networks. Hopefully, things will be more stable in the future.

HTH, Mayank

On Thu, Sep 23, 2021 at 7:14 AM Matthew @.***> wrote:

External Email: Use Caution

Hi Allen!

Thank you so much for the quick response! I'll get that running now and should have those files available for you tomorrow morning along with the log.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/365*issuecomment-925449169__;Iw!!Eh6p8Q!VqhAuPg-LFvPxeVxce6BvZm1LKg-CsHOe83P4X-xAu1pghv45x1ZX5zrSiVacFdL7Fs$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY253RMA6I6GZPULJZLUDKA7HANCNFSM5BWJXZTA__;!!Eh6p8Q!VqhAuPg-LFvPxeVxce6BvZm1LKg-CsHOe83P4X-xAu1pghv45x1ZX5zrSiVaokA_5h8$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!Eh6p8Q!VqhAuPg-LFvPxeVxce6BvZm1LKg-CsHOe83P4X-xAu1pghv45x1ZX5zrSiVa-h-VsAs$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!Eh6p8Q!VqhAuPg-LFvPxeVxce6BvZm1LKg-CsHOe83P4X-xAu1pghv45x1ZX5zrSiVarQ4Z6i0$.

Hexusprime commented 2 years ago

Hi Mayank!

Thank you for the detailed reply! That's interesting to see how tensorflow is effecting the processes. I'll take a look at the param branch here when we get a chance.

Per Allen's request I'll still put the log files for both APT as well as the lbl file just incase anything can be learned and or derived from it. Anything we can do to help or test we'll do so gladly! :) I've also tossed in the log for matlab as well if that is of any interest.

You should be able to just click this and view/download anything in here you need but let me know if it gives you any permission issues.

Thank you again for all your help, we'll give this a test and get back to you 👍

Matthew

mkabra commented 2 years ago

Yes, the training loss is NaN from the first iteration itself. Definitely looks like the TF bug that Allen had discovered: https://github.com/TRI-ML/KP2D/issues/20 , https://forums.developer.nvidia.com/t/tensorflow-not-working-on-geforce-3090/166824/3

Mayank

On Thu, Sep 23, 2021 at 7:04 PM Matthew @.***> wrote:

External Email: Use Caution

Hi Mayank!

Thank you for the detailed reply! That's interesting to see how is tensorflow effecting processes. I'll take a look at the param branch here when we get a chance.

Per Allen's request I'll still put the log files for both APT as well as the lbl file just incase anything can be learned and or derived from it. Anything we can do to help or test we'll do so gladly! :) I've also tossed in the log for matlab as well if that is of any interest.

You should be able to just click this https://urldefense.com/v3/__https://maxplanckflorida-my.sharepoint.com/:f:/g/personal/morganm_mpfi_org/Eney4UbgsgdFjPB2NeFHvPwB4t6TB0-JOOOMQAc4yi4Xqw?e=xpCMsb__;!!Eh6p8Q!X4brIyXHk0GoQkn9rF3ClxwgWlciXszHxyfPfs2CvH1dfnZC0GIqSCPTRZ-i5ni5_mQ$ and view/download anything in here you need but let me know if it gives you any permission issues.

Thank you again for all your help, we'll give this a test and get back to you 👍

Matthew

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/365*issuecomment-925822180__;Iw!!Eh6p8Q!X4brIyXHk0GoQkn9rF3ClxwgWlciXszHxyfPfs2CvH1dfnZC0GIqSCPTRZ-i7Q8l7jI$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY4XMQUOC2KB5X5I22DUDMUIDANCNFSM5BWJXZTA__;!!Eh6p8Q!X4brIyXHk0GoQkn9rF3ClxwgWlciXszHxyfPfs2CvH1dfnZC0GIqSCPTRZ-iMrziUyg$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!Eh6p8Q!X4brIyXHk0GoQkn9rF3ClxwgWlciXszHxyfPfs2CvH1dfnZC0GIqSCPTRZ-imhHR27Q$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!Eh6p8Q!X4brIyXHk0GoQkn9rF3ClxwgWlciXszHxyfPfs2CvH1dfnZC0GIqSCPTRZ-iR_hO06Y$.

pascar4 commented 2 years ago

While using a 1080 TI, the docker job is created then disappears in under a minute cancelling the training. (the backend test was successful) Resulting in a NAN/20000. I've checked the error log file and it is blank. Maybe this is the same issue with TF or maybe a different issue? any insight would help APT_NAN_Error

mkabra commented 2 years ago

1080 Ti is a relatively old GPU and doesn't seem to be supported by CUDA 10.0, which is what is required for TF 1.15.

Mayank

On Fri, Jan 14, 2022 at 2:41 AM pascar4 @.***> wrote:

External Email: Use Caution

While using a 1080 TI, the docker job is created then disappears in under a minute cancelling the training. (the backend test was successful) Resulting in a NAN/20000. I've checked the error log file and it is blank. Maybe this is the same issue with TF or maybe a different issue? any insight would help [image: APT_NAN_Error] https://urldefense.com/v3/__https://user-images.githubusercontent.com/39103165/149408903-1391d117-9cfa-4244-948f-0af3151c610e.PNG__;!!Eh6p8Q!VZtta34dtcaYm5WCYAR-p2o42sntp1xih1teeYSijRhjzQAX0IhdYp7OkebFGJmhKUw$

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/365*issuecomment-1012518976__;Iw!!Eh6p8Q!VZtta34dtcaYm5WCYAR-p2o42sntp1xih1teeYSijRhjzQAX0IhdYp7OkebFlyacqzw$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY75JLAYBYDUUSTTHVDUV45ZJANCNFSM5BWJXZTA__;!!Eh6p8Q!VZtta34dtcaYm5WCYAR-p2o42sntp1xih1teeYSijRhjzQAX0IhdYp7OkebFo23zxf8$ . Triage notifications on the go with GitHub Mobile for iOS https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!Eh6p8Q!VZtta34dtcaYm5WCYAR-p2o42sntp1xih1teeYSijRhjzQAX0IhdYp7OkebFSby5_3M$ or Android https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!Eh6p8Q!VZtta34dtcaYm5WCYAR-p2o42sntp1xih1teeYSijRhjzQAX0IhdYp7OkebFjqmcLIc$.

You are receiving this because you were mentioned.Message ID: @.***>

evo11x commented 2 years ago

I have the same problem with ray/pytorch cuda 11.7 and RTX3060, it seems this problem comes from nvidia cuda 11.x