Closed dnaligase closed 4 years ago
Hi Georgy,
Nice to hear from you and thanks for the detailed issue report. Here are some questions/steps to get us going with debugging, please let us know.
When you click Train, does the Training Monitor pop up? This is a window that displays a plot of the deep-learning loss (the plot will be blank to start). At the bottom there is a pulldown menu with various options: "List all conda jobs", "Show log files", "Show error messages". If you are able to press "Go" when these selected, there may be useful information for us.
Towards the bottom of your command-window log, there is a line that starts activate APT &&...
. This is the raw system command that is sent to Windows that runs the training. I cut-and-pasted this line into an editor so I could see it better. In particular there are two files listed: i) an .err file listed after the -err_file option; and ii) a .log file at the end of the command. If those files exist, again their contents could be useful. (These are the same log files and error messages from Step 1, so if you already got their contents you can skip ahead.) Note if you click Train again, these files will be different so make sure to grab the commandline from the newest/latest run if you re-try things.
We can try running the raw command directly in Windows. Open a Command Prompt (eg type 'cmd' in the Windows search bar). Now from your command prompt, you can try running this line:
activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=1&& python "D:\JAABA_projects\APT\deepnet\APT_interface.py" -name 20200425T212812 -view 1 -cache "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f" -err_file "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812view0_20200425T213006.err" -type deeplabcut "C:\Users\name\Documents\.apt\tpe52f8f43_ef67_4801_b7d1_14213b3d6b4f\test\20200425T212812_20200425T213006.lbl" train -use_cache
If this command runs and tensorflow starts printing losses etc, then we know that the backend training works.
activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=1&& python "D:\JAABA_projects\APT\deepnet\APT_interface.py" lbl test hello
Dear Allen,
Thank you for your reply!
1-2. Training monitor pops up without plots, but I'm able to click on 'Go'. It gives me:
No jobs running
No jobs queued
Moreover, .apt\ directory contains this .log file among other files:
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Traceback (most recent call last):
File "D:\JAABA_projects\APT\deepnet\APT_interface.py", line 17, in <module>
import PoseUNet_dataset as PoseUNet
File "D:\JAABA_projects\APT\deepnet\PoseUNet_dataset.py", line 1, in <module>
from PoseCommon_dataset import PoseCommon, PoseCommonMulti, PoseCommonRNN, PoseCommonTime, conv_relu3, conv_shortcut
File "D:\JAABA_projects\APT\deepnet\PoseCommon_dataset.py", line 9, in <module>
import PoseTools
File "D:\JAABA_projects\APT\deepnet\PoseTools.py", line 13, in <module>
import math, h5py
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\h5py\__init__.py", line 45, in <module>
from . import h5a, h5d, h5ds, h5f, h5fd, h5g, h5r, h5s, h5t, h5p, h5z
ImportError: DLL load failed: The specified procedure could not be found.
and also .err file:
2020-04-26 21:45:24,146 D:\JAABA_projects\APT\deepnet\APT_interface.py train [ERROR] Out of GPU Memory. Either reduce the batch size or scale down the image
Traceback (most recent call last):
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[9,96,128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node resnet_v1_50/block2/unit_1/bottleneck_v1/conv3/Conv2D-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node GroupCrossDeviceControlEdges_0/absolute_difference/weighted_loss/num_present/broadcast_weights/assert_broadcastable/AssertGuard/Assert/data_7}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\JAABA_projects\APT\deepnet\APT_interface.py", line 2050, in train
train_deepcut(conf,args, split_file=split_file)
File "D:\JAABA_projects\APT\deepnet\APT_interface.py", line 1999, in train_deepcut
deepcut_train(conf,name=args.train_name)
File "D:\JAABA_projects\APT\deepnet\deepcut\train.py", line 209, in train
feed_dict={learning_rate: current_lr})
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[9,96,128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node resnet_v1_50/block2/unit_1/bottleneck_v1/conv3/Conv2D-1-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node GroupCrossDeviceControlEdges_0/absolute_difference/weighted_loss/num_present/broadcast_weights/assert_broadcastable/AssertGuard/Assert/data_7}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
3. The output is bulky enough, so I attach the link. I'd like to mention, that that both GPUs on the machine have 8 Gb, so that I reassigned minFreeMem = 7000;
in DLBackEndClass.m
; is this valid?
log2.txt
4. This is what I got:
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\name\Anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.
['lbl', 'test', 'hello']
Hello this is APT!
As this is a common machine, I didn't notice at first that there are two Anacondas installed, may it cause some conflicts? Thank you for your support. I hope I'm not missing something transparent.
Kind regards, Georgy
Hmm, interesting. It looks like you are running out of GPU memory.
What value of batch_size did you reduce to, batch_size=2? When you set the batch_size in the Tracking Parameters, was there a visualization/estimate of GPU memory usage in the right-hand pane? This should give you an estimate of the GPU memory required. On a related note, what is the image size of your movies?
Re reassigning minFreeMem = 7000 in DLBackEndClass.m, yes this is a fine thing to do. I do this sometimes as well depending on the machine. (Maybe there is a way to make this friendlier in a future enhancement.)
Note that your log contains the line "totalMemory: 8.00GiB freeMemory: 6.61GiB" so you may effectively be running with a little less GPU memory than expected.
(In general it's a good idea to save the Project after tweaking any parameters so that changes are retained.)
It is possible you have two separate issues, the h5py import issue and the out-of-memory issue. I'd suggest let's solve the out-of-memory thing first and make sure we can get Training going from raw commands.
Dear Allen,
Sorry for those prolonged replies. The image video size is 1024 x 768, I adjusted the batch_size and downsample_factor to 2 (Memory required: 2.7 Gb) and 4, respectively. It seems like it's running now. I can see in the log files monitor:
2020-04-28 00:19:31,595 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 0 loss: 0.0158 dist: 11244.30 lr: 0.005
2020-04-28 00:24:10,498 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 50 loss: 0.0205 dist: 10595.11 lr: 0.005
2020-04-28 00:28:46,802 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 100 loss: 0.0000 dist: 10667.67 lr: 0.005
2020-04-28 00:33:24,412 D:\JAABA_projects\APT\deepnet\deepcut\train.py train [INFO ] iteration: 150 loss: 0.0000 dist: 10591.25 lr: 0.005
Seems like it's solved, and the issue was in the free GPU memory. I'll come back as it finishes training.
Thank you, Georgy
Cool, glad you are running. Let us know how it goes. Using a larger downsample factor can impact tracking accuracy, so if that is important it might be worth trying a smaller downsample value (at some point) to increase the working resolution of the network.
Hi everyone,
Recently, I've been using
createFromDeepLabCut()
function to export DeepLabCut training data to APT. Once I've done this, I then set the tracking algorithm to DeepLabCut, select the local backend (it passes both tests), configure the tracking parameters. However, after hitting 'Train' I end up with the dialog as shown in the screenshot. I've also tried to select MDN, but nothing worked out.Log:
Worth mentioning, that creating the project from scratch with some frames labeled also didn't work. I'm operating on Win10, Matlab ver. is R2018b. I'd be grateful for any help.
Thank you, Georgy