kristinbranson / APT

Animal Part Tracker
GNU General Public License v3.0
71 stars 16 forks source link

Training stops after additional labeling #337

Closed Junes94 closed 4 years ago

Junes94 commented 4 years ago

Hi, everyone. I have multiview video(2 camera) and 8 sets of them in my project. And a few problem comes out.

firstly, I labeled about 100 frames at all movies (total 200 frames from both viewpoints), and MDN training worked well (iteration=10000). After tracking my movies, I wanted to label additional frames, so I labeled about 330 frames (total 660). However, when I clicked Train, after a few minutes, a dialogue popped up as shown below. 주석 2020-07-01 000347

and training monitor said:

No jobs running 
No jobs queued

and here's my log:


Training started at 01-Jul-2020 00:02:24... Your deep net type is: mdn Your training backend is: Conda Your training vizualizer is: TrainMonitorViz

Training new model 20200701T000224. Tensorflow resnet pretrained weights
http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz already downloaded. Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz already downloaded. Training with 328 rows. Training data summary: Group (mov): 1. nfrm=46, nfrmlbled=46. Group (mov): 2. nfrm=51, nfrmlbled=51. Group (mov): 3. nfrm=50, nfrmlbled=50. Group (mov): 4. nfrm=34, nfrmlbled=34. Group (mov): 5. nfrm=51, nfrmlbled=51. Group (mov): 6. nfrm=40, nfrmlbled=40. Group (mov): 7. nfrm=56, nfrmlbled=56. Stripped lbl preproc data cache: exporting 328/328 training rows. Saved stripped lbl file:

C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.lbl Configuring background worker... activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py" -name 20200701T000224 -cache "C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906" -err_file "C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.err"` -type mdn "C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.lbl" train -use_cache > C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224_new.log 2>&1 Process job (movie 1, view 1) spawned, ID = 8:

Time to compute info statistic dx = 0.000985 Error occurred during train:

C:\Users\MyPC\Documents.apt\tp348c88a6_4d07_4710_ab57_ab64622e6906\APTproject\20200701T000224_20200701T000224.err

2020-07-01 00:02:34,801 C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py main [ERROR] UNKNOWN: APT_interface errored Traceback (most recent call last): File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 1659, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid reduction dimension 1 for input with 1 dimensions. for 'Sum_5' (op: 'Sum') with input shapes: [?], [] and with computed input tensors: input[1] = <1>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 2421, in main run(args) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 2178, in run train(lbl_file, nviews, name, args) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 2038, in train train_mdn(conf, args, restore, split, split_file=split_file) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py", line 1971, in train_mdn self.train_umdn(restore=restore) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseUNet_resnet.py", line 639, in train_umdn learning_rate=learning_rate,restore=restore) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseCommon_dataset.py", line 588, in train self.cost = loss(self.inputs, self.pred) File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseUNet_resnet.py", line 653, in loss dist_loss = self.dist_loss() / 10 File "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\PoseUNet_resnet.py", line 805, in dist_loss pp = ll[:,:, ndx] tf.reduce_sum(sel_comp, axis=1) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(args, kwargs) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1286, in reduce_sum_v1 return reduce_sum(input_tensor, axis, keepdims, name) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\util\dispatch.py", line 180, in wrapper return target(*args, *kwargs) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1334, in reduce_sum name=name)) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 9610, in _sum name=name) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(args, kwargs) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op op_def=op_def) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 1823, in init control_input_ops) File "C:\Users\MyPC\anaconda3\envs\APT\lib\site-packages\tensorflow\python\framework\ops.py", line 1662, in _create_c_op raise ValueError(str(e)) ValueError: Invalid reduction dimension 1 for input with 1 dimensions. for 'Sum_5' (op: 'Sum') with input shapes: [?], [] and with computed input tensors: input[1] = <1>.

. You may need to manually kill any running DeepLearning process.


I used 'Windows10', and 'MATLAB R 2019a'. Required memory for training with 100 frames and with 330 frames are same (when I checked the tracking parameters window), so the error seems not to be the matter of memory. It would be pleasure if there's any help.

Thank you, Junesu LEE

Junes94 commented 4 years ago

Before the screenshot popped up, there's following status in training monitor. You could ignore the time below.

Jobs running:
ID 9, started 2020-07-01 00:50:27: running
ID 10, started 2020-07-01 00:50:27: running 
No jobs queued.
allenleetc commented 4 years ago

Hi June,

Whoa, this looks like a really interesting bug. I think I reproduced it.

A couple questions:

Thanks for the report it's a "good" bug!

Junes94 commented 4 years ago

Hi, thanks for your quick support.

  1. Yes, firstly I succeeded with 10000 iterations, and I changed the parameter to 20000 iterations for more accuracy.
  2. Also, at the same time, I changed the 'Predict confidence' OFF to ON.

After I changed 'Predict confidence' ON to OFF again, training is now going on (seems to work well). I don't know it would be helpful, but here's my log.


Training started at 01-Jul-2020 10:14:40... Your deep net type is: mdn Your training backend is: Conda Your training vizualizer is: TrainMonitorViz

Tensorflow resnet pretrained weights http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz already downloaded. Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz already downloaded. Training with 328 rows. Training data summary: Group (mov): 1. nfrm=46, nfrmlbled=46. Group (mov): 2. nfrm=51, nfrmlbled=51. Group (mov): 3. nfrm=50, nfrmlbled=50. Group (mov): 4. nfrm=34, nfrmlbled=34. Group (mov): 5. nfrm=51, nfrmlbled=51. Group (mov): 6. nfrm=40, nfrmlbled=40. Group (mov): 7. nfrm=56, nfrmlbled=56. Stripped lbl preproc data cache: exporting 328/328 training rows. Saved stripped lbl file: C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl Configuring background worker... activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py" -name 20200701T101440 -cache "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d" -err_file "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.err" -type mdn "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl" train -use_cache > C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441_new.log 2>&1 Process job (movie 1, view 1) spawned, ID = 8:


I'll upload soon if there's any more bugs. Thank you,

Junesu LEE

mkabra commented 4 years ago

Hi June,

I've pushed a fix. Training with "Predict confidence" selected should now work.

Mayank

On Wed, Jul 1, 2020 at 7:00 AM June94 notifications@github.com wrote:

Hi, thanks for your quick support.

  1. Yes, firstly I succeeded with 10000 iterations, and I changed the parameter to 20000 iterations for more accuracy.
  2. Also, at the same time, I changed the 'Predict confidence' OFF to ON.

After I changed 'Predict confidence' ON to OFF again, training is now going on (seems to work well). I don't know it would be helpful, but here's my log.

Training started at 01-Jul-2020 10:14:40... Your deep net type is: mdn Your training backend is: Conda Your training vizualizer is: TrainMonitorViz

Tensorflow resnet pretrained weights http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz https://urldefense.com/v3/__http://download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC.tar.gz__;!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRbI4r87E$ already downloaded. Tensorflow resnet pretrained weights http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz https://urldefense.com/v3/__http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz__;!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRN_ePma0$ already downloaded. Training with 328 rows. Training data summary: Group (mov): 1. nfrm=46, nfrmlbled=46. Group (mov): 2. nfrm=51, nfrmlbled=51. Group (mov): 3. nfrm=50, nfrmlbled=50. Group (mov): 4. nfrm=34, nfrmlbled=34. Group (mov): 5. nfrm=51, nfrmlbled=51. Group (mov): 6. nfrm=40, nfrmlbled=40. Group (mov): 7. nfrm=56, nfrmlbled=56. Stripped lbl preproc data cache: exporting 328/328 training rows. Saved stripped lbl file: C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl Configuring background worker... activate APT&& set CUDA_DEVICE_ORDER=PCI_BUS_ID&& set CUDA_VISIBLE_DEVICES=0&& python "C:\Users\MyPC\Desktop\forelimb실험신누\APT-develop\deepnet\APT_interface.py" -name 20200701T101440 -cache "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d" -err_file "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.err" -type mdn "C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441.lbl" train -use_cache > C:\Users\MyPC\Documents.apt\tpfedf6176_ca55_4f12_af06_76d5e043bc3d\APTproject\20200701T101440_20200701T101441_new.log 2>&1 Process job (movie 1, view 1) spawned, ID = 8:

I'll upload soon if there's any more bugs. Thank you,

Junesu LEE

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/kristinbranson/APT/issues/337*issuecomment-652133775__;Iw!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRWKcz_kg$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AAJNKY2NUIP3GWO3TUXNTG3RZKGSDANCNFSM4OML454Q__;!!Eh6p8Q!XktXbfp-tOQyrqy9Fsb7qi_9kdfcCIoIhj58FSzLTY9Jyg190JVwEPM7rzjRE-smLRM$ .

Junes94 commented 4 years ago

Hi Mayank, I'm grateful for your support and feedback.

Thank you, June