Training stops after additional labeling

Junes94 commented 4 years ago

Hi, everyone. I have multiview video(2 camera) and 8 sets of them in my project. And a few problem comes out.

firstly, I labeled about 100 frames at all movies (total 200 frames from both viewpoints), and MDN training worked well (iteration=10000). After tracking my movies, I wanted to label additional frames, so I labeled about 330 frames (total 660). However, when I clicked Train, after a few minutes, a dialogue popped up as shown below. 주석 2020-07-01 000347

and training monitor said:

No jobs running 
No jobs queued

and here's my log:

Training started at 01-Jul-2020 00:02:24... Your deep net type is: mdn Your training backend is: Conda Your training vizualizer is: TrainMonitorViz

Training new model 20200701T000224. Tensorflow resnet pretrained weights
http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz already downloaded. Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz already downloaded. Training with 328 rows. Training data summary: Group (mov): 1. nfrm=46, nfrmlbled=46. Group (mov): 2. nfrm=51, nfrmlbled=51. Group (mov): 3. nfrm=50, nfrmlbled=50. Group (mov): 4. nfrm=34, nfrmlbled=34. Group (mov): 5. nfrm=51, nfrmlbled=51. Group (mov): 6. nfrm=40, nfrmlbled=40. Group (mov): 7. nfrm=56, nfrmlbled=56. Stripped lbl preproc data cache: exporting 328/328 training rows. Saved stripped lbl file:

C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.lbl Configuring background worker... activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py" -name 20200701T000224 -cache "C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906" -err_file "C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.err"` -type mdn "C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.lbl" train -use_cache > C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224_new.log 2>&1 Process job (movie 1, view 1) spawned, ID = 8:

Time to compute info statistic dx = 0.000985 Error occurred during train:

C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.err

2020-07-01 00:02:34,801 C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py main [ERROR] UNKNOWN: APT_interface errored Traceback (most recent call last): File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 1659, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid reduction dimension 1 for input with 1 dimensions. for 'Sum_5' (op: 'Sum') with input shapes: [?], [] and with computed input tensors: input[1] = <1>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 2421, in main run(args) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 2178, in run train(lbl_file, nviews, name, args) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 2038, in train train_mdn(conf, args, restore, split, split_file=split_file) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 1971, in train_mdn self.train_umdn(restore=restore) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseUNet_resnet.py", line 639, in train_umdn learning_rate=learning_rate,restore=restore) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseCommon_dataset.py", line 588, in train self.cost = loss(self.inputs, self.pred) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseUNet_resnet.py", line 653, in loss dist_loss = self.dist_loss() / 10 File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseUNet_resnet.py", line 805, in dist_loss pp = ll[:,:, ndx] tf.reduce_sum(sel_comp, axis=1) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(args, kwargs) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1286, in reduce_sum_v1 return reduce_sum(input_tensor, axis, keepdims, name) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\util\dispatch.py", line 180, in wrapper return target(*args, *kwargs) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1334, in reduce_sum name=name)) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 9610, in _sum name=name) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(args, kwargs) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op op_def=op_def) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 1823, in init control_input_ops) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 1662, in _create_c_op raise ValueError(str(e)) ValueError: Invalid reduction dimension 1 for input with 1 dimensions. for 'Sum_5' (op: 'Sum') with input shapes: [?], [] and with computed input tensors: input[1] = <1>.

. You may need to manually kill any running DeepLearning process.

I used 'Windows10', and 'MATLAB R 2019a'. Required memory for training with 100 frames and with 330 frames are same (when I checked the tracking parameters window), so the error seems not to be the matter of memory. It would be pleasure if there's any help.

Thank you, Junesu LEE

Junes94 commented 4 years ago

Before the screenshot popped up, there's following status in training monitor. You could ignore the time below.

Jobs running:
ID 9, started 2020-07-01 00:50:27: running
ID 10, started 2020-07-01 00:50:27: running 
No jobs queued.

allenleetc commented 4 years ago

Hi June,

Whoa, this looks like a really interesting bug. I think I reproduced it.

A couple questions:

It sounds like you have managed to train and track successfully with this project in the past. Since then, have you changed any Tracking Parameters or the network? Or have you only added new labels?
In your Tracking Parameters, is Tracking Parameters>MDN>Predict confidence turned on? If so, did you turn this on or was it already/always on?

Thanks for the report it's a "good" bug!

Junes94 commented 4 years ago

Hi, thanks for your quick support.

Yes, firstly I succeeded with 10000 iterations, and I changed the parameter to 20000 iterations for more accuracy.
Also, at the same time, I changed the 'Predict confidence' OFF to ON.

After I changed 'Predict confidence' ON to OFF again, training is now going on (seems to work well). I don't know it would be helpful, but here's my log.

Training started at 01-Jul-2020 10:14:40... Your deep net type is: mdn Your training backend is: Conda Your training vizualizer is: TrainMonitorViz

Tensorflow resnet pretrained weights http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz already downloaded. Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz already downloaded. Training with 328 rows. Training data summary: Group (mov): 1. nfrm=46, nfrmlbled=46. Group (mov): 2. nfrm=51, nfrmlbled=51. Group (mov): 3. nfrm=50, nfrmlbled=50. Group (mov): 4. nfrm=34, nfrmlbled=34. Group (mov): 5. nfrm=51, nfrmlbled=51. Group (mov): 6. nfrm=40, nfrmlbled=40. Group (mov): 7. nfrm=56, nfrmlbled=56. Stripped lbl preproc data cache: exporting 328/328 training rows. Saved stripped lbl file: C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl Configuring background worker... activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py" -name 20200701T101440 -cache "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d" -err_file "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.err" -type mdn "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl" train -use_cache > C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441_new.log 2>&1 Process job (movie 1, view 1) spawned, ID = 8:

I'll upload soon if there's any more bugs. Thank you,

Junesu LEE

mkabra commented 4 years ago

Hi June,

I've pushed a fix. Training with "Predict confidence" selected should now work.

Mayank

On Wed, Jul 1, 2020 at 7:00 AM June94 notifications@github.com wrote:

Hi, thanks for your quick support.

Yes, firstly I succeeded with 10000 iterations, and I changed the parameter to 20000 iterations for more accuracy.

Also, at the same time, I changed the 'Predict confidence' OFF to ON.

After I changed 'Predict confidence' ON to OFF again, training is now going on (seems to work well). I don't know it would be helpful, but here's my log.

Training started at 01-Jul-2020 10:14:40... Your deep net type is: mdn Your training backend is: Conda Your training vizualizer is: TrainMonitorViz

Tensorflow resnet pretrained weights http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz https://urldefense.com/v3/__http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz__;!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRbI4r87E$ already downloaded. Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz https://urldefense.com/v3/__http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz__;!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRN_ePma0$ already downloaded. Training with 328 rows. Training data summary: Group (mov): 1. nfrm=46, nfrmlbled=46. Group (mov): 2. nfrm=51, nfrmlbled=51. Group (mov): 3. nfrm=50, nfrmlbled=50. Group (mov): 4. nfrm=34, nfrmlbled=34. Group (mov): 5. nfrm=51, nfrmlbled=51. Group (mov): 6. nfrm=40, nfrmlbled=40. Group (mov): 7. nfrm=56, nfrmlbled=56. Stripped lbl preproc data cache: exporting 328/328 training rows. Saved stripped lbl file: C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl Configuring background worker... activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py" -name 20200701T101440 -cache "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d" -err_file "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.err" -type mdn "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl" train -use_cache > C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441_new.log 2>&1 Process job (movie 1, view 1) spawned, ID = 8:

I'll upload soon if there's any more bugs. Thank you,

Junesu LEE

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/337*issuecomment-652133775__;Iw!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRWKcz_kg$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY2NUIP3GWO3TUXNTG3RZKGSDANCNFSM4OML454Q__;!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRE-smLRM$ .

Junes94 commented 4 years ago

Hi Mayank, I'm grateful for your support and feedback.

Thank you, June

kristinbranson / APT

Training stops after additional labeling #337

C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.err

After I changed 'Predict confidence' ON to OFF again, training is now going on (seems to work well). I don't know it would be helpful, but here's my log.