Closed Junes94 closed 4 years ago
Hi June,
Did you recently pull/update your APT repo? And if so, was your previous successful training done before this code update?
My guess is that we need to update our conda image after some recent updates to DLC. Our mistake, this slipped through the cracks! I will do this soon. Let me know if you didn't do any updates though as that then we should dig some more.
Thanks, Allen
Oh I recently update the APT. I'll check if it works well in previous version. Thanks for your quuck reply.
Cool, yes if you kept your old repo or happen to have an older repo on your local machine it may be worth trying it as this may get you going right away. (It looks like we need to use a version from 357c49b56b5797691363cd0f2e00b6301abd20fb or earlier.)
That said, we still need to update the conda environment in any case so I will do that soon and let you know.
Hi, Allen. I tried to train and tracking again with a version from 357c49b you mentioned. The training process was successfully done. Unfortunately, tracking process showed the same issue as I posted earlier. It didn't do anything, just stopped at the '0/500 frames tracked'. Do you have any idea?
Thanks, June.
Hi June,
I think I reproduced your error and pushed a fix! Please pull 49c43c082faf20b679bf99f64c3962b572723cae and try to track again. (After pulling the new code, it is safest to close your MATLAB and restart).
In the meantime I will still update the conda image so we can run with the latest.
Let me know if this gets you going. Thanks, Allen
Hi Allen,
Thanks for your comment and 49c43c0 version tracked well ! However, 49c43c0 ver. didn't do training. Here's the log.
Training started at 24-Sep-2020 20:14:41...
Your deep net type is: deeplabcut
Your training backend is: Conda
Your training vizualizer is: TrainMonitorViz
Training new model 20200924T201441.
Tensorflow resnet pretrained weights http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz already downloaded.
Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz already downloaded.
Training with 27 rows.
Training data summary:
Group (mov): 1. nfrm=27, nfrmlbled=27.
Stripped lbl preproc data cache: exporting 27/27 training rows.
Saved stripped lbl file: C:\Users\MyPC\Documents\.apt\tp2fae7f9f_54d8_42e2_90a2_8e136b36553b\raw1_기존\20200924T201441_20200924T201441.lbl
Configuring background worker...
activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py" -name 20200924T201441 -view 1 -cache "C:\Users\MyPC\Documents\.apt\tp2fae7f9f_54d8_42e2_90a2_8e136b36553b" -err_file "C:\Users\MyPC\Documents\.apt\tp2fae7f9f_54d8_42e2_90a2_8e136b36553b\raw1_기존\20200924T201441view0_20200924T201441.err" -type deeplabcut "C:\Users\MyPC\Documents\.apt\tp2fae7f9f_54d8_42e2_90a2_8e136b36553b\raw1_기존\20200924T201441_20200924T201441.lbl" train -use_cache > C:\Users\MyPC\Documents\.apt\tp2fae7f9f_54d8_42e2_90a2_8e136b36553b\raw1_기존\20200924T201441view0_20200924T201441_new.log 2>&1
Process job (movie 1, view 1) spawned, ID = 6:
Time to compute info statistic dx = 0.002138
Error occurred during train:
### C:\Users\MyPC\Documents\.apt\tp2fae7f9f_54d8_42e2_90a2_8e136b36553b\raw1_기존\20200924T201441view0_20200924T201441.err
2020-09-24 20:14:56,826 C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py main [ERROR] UNKNOWN: APT_interface errored
Traceback (most recent call last):
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node resnet_v1_50/conv1/Conv2D}}]]
[[{{node add}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2571, in main
run(args)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2320, in run
train(lbl_file, nviews, name, args)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2164, in train
train_deepcut(conf,args, split_file=split_file)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2113, in train_deepcut
deepcut_train(conf,name=args.train_name)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\train.py", line 224, in train
feed_dict={learning_rate: current_lr})
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node resnet_v1_50/conv1/Conv2D (defined at C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\nnet\pose_net.py:74) ]]
[[node add (defined at C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\nnet\pose_net.py:137) ]]
Caused by op 'resnet_v1_50/conv1/Conv2D', defined at:
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2577, in <module>
main(sys.argv[1:])
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2571, in main
run(args)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2320, in run
train(lbl_file, nviews, name, args)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2164, in train
train_deepcut(conf,args, split_file=split_file)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\APT_interface.py", line 2113, in train_deepcut
deepcut_train(conf,name=args.train_name)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\train.py", line 186, in train
losses = net.train(batch)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\nnet\pose_net.py", line 112, in train
heads = self.get_net(batch[Batch.inputs])
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\nnet\pose_net.py", line 101, in get_net
net, end_points = self.extract_features(inputs)
File "C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\nnet\pose_net.py", line 74, in extract_features
global_pool=False, output_stride=16,is_training=False)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\contrib\slim\python\slim\nets\resnet_v1.py", line 274, in resnet_v1_50
scope=scope)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\contrib\slim\python\slim\nets\resnet_v1.py", line 205, in resnet_v1
net = resnet_utils.conv2d_same(net, 64, 7, stride=2, scope='conv1')
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\contrib\slim\python\slim\nets\resnet_utils.py", line 146, in conv2d_same
scope=scope)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1155, in convolution2d
conv_dims=2)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 182, in func_with_args
return func(*args, **current_args)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1058, in convolution
outputs = layer.apply(inputs)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1227, in apply
return self.__call__(inputs, *args, **kwargs)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\layers\base.py", line 530, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 554, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\keras\layers\convolutional.py", line 194, in call
outputs = self._convolution_op(inputs, self.kernel)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 966, in __call__
return self.conv_op(inp, filter)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 591, in __call__
return self.call(inp, filter)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\nn_ops.py", line 208, in __call__
name=self.name)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
op_def=op_def)
File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node resnet_v1_50/conv1/Conv2D (defined at C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\nnet\pose_net.py:74) ]]
[[node add (defined at C:\Users\MyPC\Desktop\APT-49c43c082faf20b679bf99f64c3962b572723cae\deepnet\deepcut\nnet\pose_net.py:137) ]]
. You may need to manually kill any running DeepLearning process.
I could do training with 357c49b ver. and tracking with 49c43c0 ver. But I just let you know the bug if it's helpful. Thank you.
Hey June,
So far I can't reproduce this issue with 49c43c082faf20b679bf99f64c3962b572723cae. I opened a test project and switched between Tracking and Training a few times and so far it has been working successfully.
That said, I have seen your error before and sometimes it seems to relate to deep-learning processes not getting cleared out of memory, various libraries (TensorFlow, CUDA etc) getting into a bad state, and so on. (Here's a thread with similar observations: https://stackoverflow.com/questions/53698035/failed-to-get-convolution-algorithm-this-is-probably-because-cudnn-failed-to-in)
Sometimes with my Linux machine I actually just reboot it and then things get working again. Could you give this a try? I know it's crude but a lot of the time this gets me going again.
Hey June,
I also just pushed an update to the conda environment-- this is in 679c17f5e5a4331d4e398a23d2df57c29c01a055. This has not been comprehensively tested yet, but I trained and tracked with DLC on my Windows machine successfully.
To use the update, pull the latest from GitHub, then follow the instructions at https://github.com/kristinbranson/APT/wiki/Windows-&-Conda-Setup. Note, you will probably need to use the --force
flag since you will be overwriting your old Conda environment.
Hopefully one of these solutions will work for you!
develop
should work. Hi Allen,
I used 679c17f version and your instructions at https://github.com/kristinbranson/APT/wiki/Windows-&-Conda-Setup. But it didn't fix my error and even limiting the memory allocation of GPU didn't. So, I just uninstalled all my cuda, cudnn, python, and APT, and reintalled them. It seemed working well, but when I increase the number of frames to be tracked, it showed error again. It just stopped at 399/(frames I set) repeatedly. Interestingly, I fixed my problem by changing the code of 'APT-develop\deepnet\APT_interface.py", line 2273'
if cur_b % 400 == 399:
to
if cur_b % 10000 == 9999:
After changing this code, I could track about 10000 frames successfully and the trk file also worked in JAABA. I just wondered the reason you set the line for that specific number 400, and if there's no problem arbitrarily setting the number to 10000 or something.
Thanks for your kind reply. Junesu
Hey June,
Great sounds like we are making some progress. Yes you are right about the code -- setting that number to 10000 should be fine, there is nothing special about the number 400. This number controls how frequently the Tracking Monitor updates and is not critical to the core tracking process itself.
I cannot reproduce this issue on my more powerful desktop machine (on Linux; but the codepath should be the same), but on my Windows laptop I do see this some of the time. If it is convenient it may be interesting to know your hardware specifications as it may be a resource issue. One option might be for us to make the Monitor updates configurable or less frequent etc so there is less overhead on the machine.
Couple side questions:
Thanks, Allen
Yes, I agree it's the resource issue cause the number of frames to be tracked mattered in my bug. Here's my device specifications CPU: intel i7-9700K RAM: 16GB Windows10, x64 GPU: NVIDIA GeForce RTX 2080 Ti
And for your questions 1) I select actually 8000 frames in the timeline. I changed the Neighborhood Radius to 4000 and used 'Within 4000 frames of current frame' at the 6000th frame, so 2000 to 10000 frames I tracked. 2) There's no any update during tracking. As you said, cause I set cur_b 10000, they seemed never updated or restored trk until the process ended.
Thank you, Junesu
Hey Junesu,
You know what, my mistake, I think this is a regular bug specific to Windows. Your machine seems pretty powerful so it is probably not that. In your other bug report #342, had you already made the change to 10,000 in the "if cur_b" line of code?
OK so I believe we have resolved one set of issues related to the the Conda environment being out-of-date. The remaining issue is a regular (pretty simple-looking) bug specific to windows. I will push a fix to this soon and will message on the other issue if that works. I think there may be another issue or two I will fix at the same time.
Thanks for the reports! Allen
Hi Allen, Yes, I hadn't changed the "if cur_b" line of code in report #342 . Always thanks for your effort fixing bugs.
JuneSu
This is hopefully resolved, JuneSu please feel free to re-open or comment. I will leave #342 open but it should be fixed as well.
Hi, I have an unfamiliar issue with my project in APT. I used windows10, MATLAB 2019a ver., and my movie size is 2.7GB (total 32000frames).
Firstly, I successfully loaded the movie on APT and trained tracker by DLC. Also, I confirmed that the tracker worked well for 500 frames of the movie. However, after I closed all my project and MATLAB for my other works, the issue happened.
I reopened the APT and tried to track the same movie's other frames. But Tracking monitor just stopped and did nothing as shown below.
Here's MATLAB logs.
After this commands, they do nothing. I also tried to train a tracker again. I created a new project and load the same movie, and trained it again. And here's the training window screenshot. (It may not be caused by lack of memory, because I lowered the batch size and I have much larger memory than required. Also, I passed the backend configuration)
Here's my training window log file.
And here's my MATLAB log window.
It's very weird cause it happened to all of my movies so it seems my APT directory has some error. I have no idea why my APT has issue abruptly after worked perfectly.
I'll really appreciate if there's any help. Thank you.