kristinbranson / APT

Animal Part Tracker
GNU General Public License v3.0
72 stars 14 forks source link

Training stops at NaN/no_of_iterations #331

Closed dnaligase closed 4 years ago

dnaligase commented 4 years ago

Hi everyone,

Recently, I've been using createFromDeepLabCut() function to export DeepLabCut training data to APT. Once I've done this, I then set the tracking algorithm to DeepLabCut, select the local backend (it passes both tests), configure the tracking parameters. However, after hitting 'Train' I end up with the dialog as shown in the screenshot. I've also tried to select MDN, but nothing worked out.

train_stop

Log:

Training started at 25-Apr-2020 21:28:12...
Your deep net type is: deeplabcut
Your training backend is: Conda
Your training vizualizer is: TrainMonitorViz

Tensorflow resnet pretrained weights http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz already downloaded.
Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz already downloaded.
Warning: Not including 27 partially-labeled rows. 
Adding 131 new rows to data...
Training with 131 rows.
Training data summary:
Group (mov): 1. nfrm=40, nfrmlbled=40.
Group (mov): 2. nfrm=17, nfrmlbled=17.
Group (mov): 3. nfrm=16, nfrmlbled=16.
Group (mov): 4. nfrm=15, nfrmlbled=15.
Group (mov): 5. nfrm=6, nfrmlbled=6.
Group (mov): 6. nfrm=18, nfrmlbled=18.
Group (mov): 7. nfrm=19, nfrmlbled=19.
Stripped lbl preproc data cache: exporting 131/131 training rows.
Saved stripped lbl file: C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812_20200425T213006.lbl
Configuring background worker...
activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=1&& python "D:\JAABA_projects\APT\deepnet\APT_interface.py" -name 20200425T212812 -view 1 -cache "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f" -err_file "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812view0_20200425T213006.err" -type deeplabcut "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812_20200425T213006.lbl" train -use_cache  > C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812view0_20200425T213006_new.log 2>&1
Process job (movie 1, view 1) spawned, ID = 6:

No training jobs running. - Stopping.
Time to compute info statistic dx = 0.009664

Worth mentioning, that creating the project from scratch with some frames labeled also didn't work. I'm operating on Win10, Matlab ver. is R2018b. I'd be grateful for any help.

Thank you, Georgy

allenleetc commented 4 years ago

Hi Georgy,

Nice to hear from you and thanks for the detailed issue report. Here are some questions/steps to get us going with debugging, please let us know.

  1. When you click Train, does the Training Monitor pop up? This is a window that displays a plot of the deep-learning loss (the plot will be blank to start). At the bottom there is a pulldown menu with various options: "List all conda jobs", "Show log files", "Show error messages". If you are able to press "Go" when these selected, there may be useful information for us.

  2. Towards the bottom of your command-window log, there is a line that starts activate APT &&.... This is the raw system command that is sent to Windows that runs the training. I cut-and-pasted this line into an editor so I could see it better. In particular there are two files listed: i) an .err file listed after the -err_file option; and ii) a .log file at the end of the command. If those files exist, again their contents could be useful. (These are the same log files and error messages from Step 1, so if you already got their contents you can skip ahead.) Note if you click Train again, these files will be different so make sure to grab the commandline from the newest/latest run if you re-try things.

  3. We can try running the raw command directly in Windows. Open a Command Prompt (eg type 'cmd' in the Windows search bar). Now from your command prompt, you can try running this line:

activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=1&& python "D:\JAABA_projects\APT\deepnet\APT_interface.py" -name 20200425T212812 -view 1 -cache "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f" -err_file "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812view0_20200425T213006.err" -type deeplabcut "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812_20200425T213006.lbl" train -use_cache

If this command runs and tensorflow starts printing losses etc, then we know that the backend training works.

  1. Finally, a raw smoke test command may be worth trying, again in a Windows Command Prompt:

activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=1&& python "D:\JAABA_projects\APT\deepnet\APT_interface.py" lbl test hello

dnaligase commented 4 years ago

Dear Allen,

Thank you for your reply!

1-2. Training monitor pops up without plots, but I'm able to click on 'Go'. It gives me:

No jobs running
No jobs queued

Moreover, .apt\ directory contains this .log file among other files:

C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
  File "D:\JAABA_projects\APT\deepnet\APT_interface.py", line 17, in <module>
    import PoseUNet_dataset as PoseUNet
  File "D:\JAABA_projects\APT\deepnet\PoseUNet_dataset.py", line 1, in <module>
    from PoseCommon_dataset import PoseCommon, PoseCommonMulti, PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut
  File "D:\JAABA_projects\APT\deepnet\PoseCommon_dataset.py", line 9, in <module>
    import PoseTools
  File "D:\JAABA_projects\APT\deepnet\PoseTools.py", line 13, in <module>
    import math, h5py
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\h5py\__init__.py", line 45, in <module>
    from . import h5a, h5d, h5ds, h5f, h5fd, h5g, h5r, h5s, h5t, h5p, h5z
ImportError: DLL load failed: The specified procedure could not be found.

and also .err file:

2020-04-26 21:45:24,146 D:\JAABA_projects\APT\deepnet\APT_interface.py train [ERROR] Out of GPU Memory. Either reduce the batch size or scale down the image
Traceback (most recent call last):
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[9,96,128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node resnet_v1_50/block2/unit_1/bottleneck_v1/conv3/Conv2D-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node GroupCrossDeviceControlEdges_0/absolute_difference/weighted_loss/num_present/broadcast_weights/assert_broadcastable/AssertGuard/Assert/data_7}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\JAABA_projects\APT\deepnet\APT_interface.py", line 2050, in train
    train_deepcut(conf,args, split_file=split_file)
  File "D:\JAABA_projects\APT\deepnet\APT_interface.py", line 1999, in train_deepcut
    deepcut_train(conf,name=args.train_name)
  File "D:\JAABA_projects\APT\deepnet\deepcut\train.py", line 209, in train
    feed_dict={learning_rate: current_lr})
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[9,96,128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node resnet_v1_50/block2/unit_1/bottleneck_v1/conv3/Conv2D-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node GroupCrossDeviceControlEdges_0/absolute_difference/weighted_loss/num_present/broadcast_weights/assert_broadcastable/AssertGuard/Assert/data_7}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

3. The output is bulky enough, so I attach the link. I'd like to mention, that that both GPUs on the machine have 8 Gb, so that I reassigned minFreeMem = 7000; in DLBackEndClass.m; is this valid? log2.txt

4. This is what I got:

C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.
['lbl', 'test', 'hello']
Hello this is APT!

As this is a common machine, I didn't notice at first that there are two Anacondas installed, may it cause some conflicts? Thank you for your support. I hope I'm not missing something transparent.

Kind regards, Georgy

allenleetc commented 4 years ago

Hmm, interesting. It looks like you are running out of GPU memory.

  1. What value of batch_size did you reduce to, batch_size=2? When you set the batch_size in the Tracking Parameters, was there a visualization/estimate of GPU memory usage in the right-hand pane? This should give you an estimate of the GPU memory required. On a related note, what is the image size of your movies?

  2. Re reassigning minFreeMem = 7000 in DLBackEndClass.m, yes this is a fine thing to do. I do this sometimes as well depending on the machine. (Maybe there is a way to make this friendlier in a future enhancement.)

Note that your log contains the line "totalMemory: 8.00GiB freeMemory: 6.61GiB" so you may effectively be running with a little less GPU memory than expected.

  1. In addition to the batch_size, you can try setting the ImageProcessing>Downsample factor to downsample the images and further reduce the GPU memory footprint. You could try something biggish like 4 (or even 8) just as a test to see if you can get things running. Again, there is a visualization/estimate in the right-hand pane of the amount of GPU memory that will be required.

(In general it's a good idea to save the Project after tweaking any parameters so that changes are retained.)

  1. Re the two Anacondas, I am not sure but maybe it is okay. The error about "DLL load failed" on import of h5py is familiar though. Are you starting MATLAB from within a Windows Command Shell where the APT Conda environment (or any Conda environment) is activated? That has acted poorly for us in the past. If that is the case, we currently suggest starting MATLAB eg via its Desktop Icon or the Start Menu so that it runs outside of any Conda environment.

It is possible you have two separate issues, the h5py import issue and the out-of-memory issue. I'd suggest let's solve the out-of-memory thing first and make sure we can get Training going from raw commands.

dnaligase commented 4 years ago

Dear Allen,

Sorry for those prolonged replies. The image video size is 1024 x 768, I adjusted the batch_size and downsample_factor to 2 (Memory required: 2.7 Gb) and 4, respectively. It seems like it's running now. I can see in the log files monitor:

2020-04-28 00:19:31,595 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 0 loss: 0.0158 dist: 11244.30  lr: 0.005
2020-04-28 00:24:10,498 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 50 loss: 0.0205 dist: 10595.11  lr: 0.005
2020-04-28 00:28:46,802 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 100 loss: 0.0000 dist: 10667.67  lr: 0.005
2020-04-28 00:33:24,412 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 150 loss: 0.0000 dist: 10591.25  lr: 0.005

Seems like it's solved, and the issue was in the free GPU memory. I'll come back as it finishes training.

Thank you, Georgy

allenleetc commented 4 years ago

Cool, glad you are running. Let us know how it goes. Using a larger downsample factor can impact tracking accuracy, so if that is important it might be worth trying a smaller downsample value (at some point) to increase the working resolution of the network.