Open rashidch opened 4 years ago
Me too.
same quesion...
Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?
hello ~ i have met the same probelm,and the loss does not decrease....have you sloved the problem? i am looking forward to your reply ,many thx!!!
Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?
hello ~ i have met the same probelm,and the loss does not decrease....have you sloved the problem? i am looking forward to your reply ,many thx!!!
Hey, No!
Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?
hello ~ i have met the same probelm,and the loss does not decrease....have you sloved the problem? i am looking forward to your reply ,many thx!!!
Hey, No!
same sad...
INFO:mmcv.runner.runner:workflow: [('train', 5), ('val', 1)], max: 65 epochs INFO:mmcv.runner.runner:Epoch [1][100/125] lr: 0.10000, eta: 0:16:44, time: 0.125, data_time: 0.015, memory: 858, loss: 2.3419 INFO:mmcv.runner.runner:Epoch [2][100/125] lr: 0.10000, eta: 0:10:46, time: 0.059, data_time: 0.014, memory: 858, loss: 2.3402 INFO:mmcv.runner.runner:Epoch [3][100/125] lr: 0.10000, eta: 0:08:59, time: 0.059, data_time: 0.014, memory: 858, loss: 2.3420 INFO:mmcv.runner.runner:Epoch [4][100/125] lr: 0.10000, eta: 0:08:07, time: 0.060, data_time: 0.014, memory: 858, loss: 2.3415 INFO:mmcv.runner.runner:Epoch [5][100/125] lr: 0.10000, eta: 0:07:34, time: 0.060, data_time: 0.015, memory: 858, loss: 2.3466 INFO:mmcv.runner.runner:Epoch(train) [5][6] loss: 2.3335, top1: 0.1458, top5: 0.5312 INFO:mmcv.runner.runner:Epoch [6][100/125] lr: 0.10000, eta: 0:07:09, time: 0.058, data_time: 0.014, memory: 859, loss: 2.3459 INFO:mmcv.runner.runner:Epoch [7][100/125] lr: 0.10000, eta: 0:06:50, time: 0.058, data_time: 0.014, memory: 859, loss: 2.3443 INFO:mmcv.runner.runner:Epoch [8][100/125] lr: 0.10000, eta: 0:06:35, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3476 INFO:mmcv.runner.runner:Epoch [9][100/125] lr: 0.10000, eta: 0:06:22, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3418 INFO:mmcv.runner.runner:Epoch [10][100/125] lr: 0.10000, eta: 0:06:10, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3433 INFO:mmcv.runner.runner:Epoch(train) [10][6] loss: 2.3405, top1: 0.1250, top5: 0.5104 INFO:mmcv.runner.runner:Epoch [11][100/125] lr: 0.10000, eta: 0:05:59, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3361 INFO:mmcv.runner.runner:Epoch [12][100/125] lr: 0.10000, eta: 0:05:50, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3499 INFO:mmcv.runner.runner:Epoch [13][100/125] lr: 0.10000, eta: 0:05:40, time: 0.058, data_time: 0.016, memory: 859, loss: 2.3451 INFO:mmcv.runner.runner:Epoch [14][100/125] lr: 0.10000, eta: 0:05:31, time: 0.059, data_time: 0.014, memory: 859, loss: 2.3369 INFO:mmcv.runner.runner:Epoch [15][100/125] lr: 0.10000, eta: 0:05:23, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3392 INFO:mmcv.runner.runner:Epoch(train) [15][6] loss: 2.3434, top1: 0.1250, top5: 0.4792 INFO:mmcv.runner.runner:Epoch [16][100/125] lr: 0.10000, eta: 0:05:15, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3394 INFO:mmcv.runner.runner:Epoch [17][100/125] lr: 0.10000, eta: 0:05:07, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3448 INFO:mmcv.runner.runner:Epoch [18][100/125] lr: 0.10000, eta: 0:04:59, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3451 INFO:mmcv.runner.runner:Epoch [19][100/125] lr: 0.10000, eta: 0:04:52, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3315 INFO:mmcv.runner.runner:Epoch [20][100/125] lr: 0.10000, eta: 0:04:45, time: 0.060, data_time: 0.017, memory: 859, loss: 2.3446 INFO:mmcv.runner.runner:Epoch(train) [20][6] loss: 2.3341, top1: 0.1250, top5: 0.5312 INFO:mmcv.runner.runner:Epoch [21][100/125] lr: 0.01000, eta: 0:04:37, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3390 INFO:mmcv.runner.runner:Epoch [22][100/125] lr: 0.01000, eta: 0:04:30, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3439 INFO:mmcv.runner.runner:Epoch [23][100/125] lr: 0.01000, eta: 0:04:23, time: 0.058, data_time: 0.014, memory: 859, loss: 2.3418 INFO:mmcv.runner.runner:Epoch [24][100/125] lr: 0.01000, eta: 0:04:17, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3454 INFO:mmcv.runner.runner:Epoch [25][100/125] lr: 0.01000, eta: 0:04:10, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3470 INFO:mmcv.runner.runner:Epoch(train) [25][6] loss: 2.3479, top1: 0.1250, top5: 0.5000 INFO:mmcv.runner.runner:Epoch [26][100/125] lr: 0.01000, eta: 0:04:03, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3462 INFO:mmcv.runner.runner:Epoch [27][100/125] lr: 0.01000, eta: 0:03:57, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3369 INFO:mmcv.runner.runner:Epoch [28][100/125] lr: 0.01000, eta: 0:03:50, time: 0.058, data_time: 0.016, memory: 859, loss: 2.3454 INFO:mmcv.runner.runner:Epoch [29][100/125] lr: 0.01000, eta: 0:03:43, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3453 INFO:mmcv.runner.runner:Epoch [30][100/125] lr: 0.01000, eta: 0:03:37, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3449 INFO:mmcv.runner.runner:Epoch(train) [30][6] loss: 2.3513, top1: 0.1250, top5: 0.4896 INFO:mmcv.runner.runner:Epoch [31][100/125] lr: 0.00100, eta: 0:03:30, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3327 INFO:mmcv.runner.runner:Epoch [32][100/125] lr: 0.00100, eta: 0:03:24, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3501 INFO:mmcv.runner.runner:Epoch [33][100/125] lr: 0.00100, eta: 0:03:18, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3438 INFO:mmcv.runner.runner:Epoch [34][100/125] lr: 0.00100, eta: 0:03:11, time: 0.058, data_time: 0.015, memory: 859, loss: 2.3404 INFO:mmcv.runner.runner:Epoch [35][100/125] lr: 0.00100, eta: 0:03:05, time: 0.059, data_time: 0.018, memory: 859, loss: 2.3446 INFO:mmcv.runner.runner:Epoch(train) [35][6] loss: 2.3447, top1: 0.1458, top5: 0.5521 INFO:mmcv.runner.runner:Epoch [36][100/125] lr: 0.00100, eta: 0:02:58, time: 0.059, data_time: 0.014, memory: 859, loss: 2.3453 INFO:mmcv.runner.runner:Epoch [37][100/125] lr: 0.00100, eta: 0:02:52, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3467 INFO:mmcv.runner.runner:Epoch [38][100/125] lr: 0.00100, eta: 0:02:46, time: 0.059, data_time: 0.014, memory: 859, loss: 2.3416 INFO:mmcv.runner.runner:Epoch [39][100/125] lr: 0.00100, eta: 0:02:40, time: 0.060, data_time: 0.017, memory: 859, loss: 2.3470 INFO:mmcv.runner.runner:Epoch [40][100/125] lr: 0.00100, eta: 0:02:33, time: 0.059, data_time: 0.018, memory: 859, loss: 2.3481 INFO:mmcv.runner.runner:Epoch(train) [40][6] loss: 2.3345, top1: 0.1458, top5: 0.5417 INFO:mmcv.runner.runner:Epoch [41][100/125] lr: 0.00010, eta: 0:02:27, time: 0.060, data_time: 0.015, memory: 859, loss: 2.3414 INFO:mmcv.runner.runner:Epoch [42][100/125] lr: 0.00010, eta: 0:02:21, time: 0.060, data_time: 0.018, memory: 859, loss: 2.3456 INFO:mmcv.runner.runner:Epoch [43][100/125] lr: 0.00010, eta: 0:02:15, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3401 INFO:mmcv.runner.runner:Epoch [44][100/125] lr: 0.00010, eta: 0:02:09, time: 0.059, data_time: 0.016, memory: 859, loss: 2.3525 INFO:mmcv.runner.runner:Epoch [45][100/125] lr: 0.00010, eta: 0:02:03, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3436 INFO:mmcv.runner.runner:Epoch(train) [45][6] loss: 2.3390, top1: 0.1042, top5: 0.5208 INFO:mmcv.runner.runner:Epoch [46][100/125] lr: 0.00010, eta: 0:01:56, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3454 INFO:mmcv.runner.runner:Epoch [47][100/125] lr: 0.00010, eta: 0:01:50, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3395 INFO:mmcv.runner.runner:Epoch [48][100/125] lr: 0.00010, eta: 0:01:44, time: 0.059, data_time: 0.015, memory: 859, loss: 2.3419 INFO:mmcv.runner.runner:Epoch [49][100/125] lr: 0.00010, eta: 0:01:38, time: 0.059, data_time: 0.017, memory: 859, loss: 2.3367 INFO:mmcv.runner.runner:Epoch [50][100/125] lr: 0.00010, eta: 0:01:32, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3467 INFO:mmcv.runner.runner:Epoch(train) [50][6] loss: 2.3346, top1: 0.1354, top5: 0.5000 INFO:mmcv.runner.runner:Epoch [51][100/125] lr: 0.00001, eta: 0:01:26, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3390 INFO:mmcv.runner.runner:Epoch [52][100/125] lr: 0.00001, eta: 0:01:20, time: 0.060, data_time: 0.016, memory: 859, loss: 2.3441 INFO:mmcv.runner.runner:Epoch [53][100/125] lr: 0.00001, eta: 0:01:14, time: 0.061, data_time: 0.018, memory: 859, loss: 2.3491 INFO:mmcv.runner.runner:Epoch [54][100/125] lr: 0.00001, eta: 0:01:08, time: 0.061, data_time: 0.016, memory: 859, loss: 2.3433 INFO:mmcv.runner.runner:Epoch [55][100/125] lr: 0.00001, eta: 0:01:01, time: 0.061, data_time: 0.017, memory: 859, loss: 2.3393 INFO:mmcv.runner.runner:Epoch(train) [55][6] loss: 2.3434, top1: 0.1458, top5: 0.5208 INFO:mmcv.runner.runner:Epoch [56][100/125] lr: 0.00001, eta: 0:00:55, time: 0.062, data_time: 0.017, memory: 859, loss: 2.3490 INFO:mmcv.runner.runner:Epoch [57][100/125] lr: 0.00001, eta: 0:00:49, time: 0.061, data_time: 0.014, memory: 859, loss: 2.3373 INFO:mmcv.runner.runner:Epoch [58][100/125] lr: 0.00001, eta: 0:00:43, time: 0.062, data_time: 0.016, memory: 859, loss: 2.3434 INFO:mmcv.runner.runner:Epoch [59][100/125] lr: 0.00001, eta: 0:00:37, time: 0.063, data_time: 0.015, memory: 859, loss: 2.3424 INFO:mmcv.runner.runner:Epoch [60][100/125] lr: 0.00001, eta: 0:00:31, time: 0.063, data_time: 0.015, memory: 859, loss: 2.3543 INFO:mmcv.runner.runner:Epoch(train) [60][6] loss: 2.3233, top1: 0.1458, top5: 0.5312 INFO:mmcv.runner.runner:Epoch [61][100/125] lr: 0.00001, eta: 0:00:25, time: 0.063, data_time: 0.015, memory: 859, loss: 2.3460 INFO:mmcv.runner.runner:Epoch [62][100/125] lr: 0.00001, eta: 0:00:19, time: 0.062, data_time: 0.016, memory: 859, loss: 2.3449 INFO:mmcv.runner.runner:Epoch [63][100/125] lr: 0.00001, eta: 0:00:13, time: 0.063, data_time: 0.014, memory: 859, loss: 2.3469 INFO:mmcv.runner.runner:Epoch [64][100/125] lr: 0.00001, eta: 0:00:07, time: 0.064, data_time: 0.016, memory: 859, loss: 2.3478 INFO:mmcv.runner.runner:Epoch [65][100/125] lr: 0.00001, eta: 0:00:01, time: 0.064, data_time: 0.018, memory: 859, loss: 2.3395 INFO:mmcv.runner.runner:Epoch(train) [65][6] loss: 2.3542, top1: 0.1250, top5: 0.5104
This is my training log, 10 categories, 10 samples per category, is this training correct?
i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py),
`def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs
self.optimizer.zero_grad()
self.outputs['loss'].backward()
self.optimizer.step()
self.call_hook('after_train_iter')
self._iter += 1
self.call_hook('after_train_epoch')
self._epoch += 1`
the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py
` def after_train_iter(self, runner):
runner.optimizer.zero_grad()
runner.outputs['loss'].backward()
if self.grad_clip is not None:
self.clip_grads(runner.model.parameters())
runner.optimizer.step()
`
did not know why the function does not run actually.
**so manually add these operations in the runn.py ,
then ,the loss could decrease ...**
`self.optimizer.zero_grad()
self.outputs['loss'].backward()
self.optimizer.step()
`
i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py),
`def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs
self.optimizer.zero_grad() self.outputs['loss'].backward() self.optimizer.step() self.call_hook('after_train_iter') self._iter += 1 self.call_hook('after_train_epoch') self._epoch += 1`
the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py
` def after_train_iter(self, runner):
runner.optimizer.zero_grad() runner.outputs['loss'].backward() if self.grad_clip is not None: self.clip_grads(runner.model.parameters()) runner.optimizer.step() `
did not know why the function does not run actually.
**so manually add these operations in the runn.py ,
then ,the loss could decrease ...**
`self.optimizer.zero_grad()
self.outputs['loss'].backward() self.optimizer.step()
`
Thanks, I modified the code according to your suggestion, after training is completely correct
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks:
lr_config:
policy: 'step'
step: [20, 30, 40, 50]
log_config:
interval: 100
hooks:
- type: TextLoggerHook
checkpoint_config:
interval: 5
optimizer_config:
grad_clip:
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
Hey, Can you share you complete train.yaml and final value of loss and training and test accuracy?
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
Hey, Can you share you complete train.yaml and final value of loss and training and test accuracy?
INFO:mmcv.runner.runner:workflow: [('train', 5), ('val', 1)], max: 65 epochs INFO:mmcv.runner.runner:Epoch [1][100/116] lr: 0.10000, eta: 0:17:31, time: 0.141, data_time: 0.007, memory: 456, loss: 2.2317 INFO:mmcv.runner.runner:Epoch [2][100/116] lr: 0.10000, eta: 0:12:11, time: 0.074, data_time: 0.011, memory: 456, loss: 1.7097 INFO:mmcv.runner.runner:Epoch [3][100/116] lr: 0.10000, eta: 0:10:27, time: 0.073, data_time: 0.012, memory: 456, loss: 1.6194 INFO:mmcv.runner.runner:Epoch [4][100/116] lr: 0.10000, eta: 0:09:35, time: 0.074, data_time: 0.011, memory: 456, loss: 1.5610 INFO:mmcv.runner.runner:Epoch [5][100/116] lr: 0.10000, eta: 0:09:03, time: 0.076, data_time: 0.012, memory: 456, loss: 1.4833 INFO:mmcv.runner.runner:Epoch(train) [5][5] loss: 1.4931, top1: 0.3500, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [6][100/116] lr: 0.10000, eta: 0:08:37, time: 0.074, data_time: 0.012, memory: 456, loss: 1.4393 INFO:mmcv.runner.runner:Epoch [7][100/116] lr: 0.10000, eta: 0:08:17, time: 0.074, data_time: 0.011, memory: 456, loss: 1.3877 INFO:mmcv.runner.runner:Epoch [8][100/116] lr: 0.10000, eta: 0:08:00, time: 0.073, data_time: 0.010, memory: 456, loss: 1.2841 INFO:mmcv.runner.runner:Epoch [9][100/116] lr: 0.10000, eta: 0:07:45, time: 0.074, data_time: 0.011, memory: 456, loss: 1.1788 INFO:mmcv.runner.runner:Epoch [10][100/116] lr: 0.10000, eta: 0:07:32, time: 0.075, data_time: 0.012, memory: 456, loss: 1.0855 INFO:mmcv.runner.runner:Epoch(train) [10][5] loss: 1.2552, top1: 0.5125, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [11][100/116] lr: 0.10000, eta: 0:07:20, time: 0.075, data_time: 0.011, memory: 456, loss: 0.8774 INFO:mmcv.runner.runner:Epoch [12][100/116] lr: 0.10000, eta: 0:07:09, time: 0.074, data_time: 0.011, memory: 456, loss: 0.5458 INFO:mmcv.runner.runner:Epoch [13][100/116] lr: 0.10000, eta: 0:06:58, time: 0.075, data_time: 0.011, memory: 456, loss: 0.3136 INFO:mmcv.runner.runner:Epoch [14][100/116] lr: 0.10000, eta: 0:06:48, time: 0.075, data_time: 0.011, memory: 456, loss: 0.2149 INFO:mmcv.runner.runner:Epoch [15][100/116] lr: 0.10000, eta: 0:06:39, time: 0.075, data_time: 0.012, memory: 456, loss: 0.1297 INFO:mmcv.runner.runner:Epoch(train) [15][5] loss: 0.1492, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [16][100/116] lr: 0.10000, eta: 0:06:29, time: 0.075, data_time: 0.011, memory: 456, loss: 0.1112 INFO:mmcv.runner.runner:Epoch [17][100/116] lr: 0.10000, eta: 0:06:20, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0726 INFO:mmcv.runner.runner:Epoch [18][100/116] lr: 0.10000, eta: 0:06:11, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0188 INFO:mmcv.runner.runner:Epoch [19][100/116] lr: 0.10000, eta: 0:06:03, time: 0.075, data_time: 0.010, memory: 456, loss: 0.0255 INFO:mmcv.runner.runner:Epoch [20][100/116] lr: 0.10000, eta: 0:05:54, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0666 INFO:mmcv.runner.runner:Epoch(train) [20][5] loss: 0.0752, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [21][100/116] lr: 0.01000, eta: 0:05:45, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0129 INFO:mmcv.runner.runner:Epoch [22][100/116] lr: 0.01000, eta: 0:05:37, time: 0.075, data_time: 0.010, memory: 456, loss: 0.0077 INFO:mmcv.runner.runner:Epoch [23][100/116] lr: 0.01000, eta: 0:05:28, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0054 INFO:mmcv.runner.runner:Epoch [24][100/116] lr: 0.01000, eta: 0:05:20, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0059 INFO:mmcv.runner.runner:Epoch [25][100/116] lr: 0.01000, eta: 0:05:12, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0062 INFO:mmcv.runner.runner:Epoch(train) [25][5] loss: 0.0363, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [26][100/116] lr: 0.01000, eta: 0:05:03, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0040 INFO:mmcv.runner.runner:Epoch [27][100/116] lr: 0.01000, eta: 0:04:55, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0035 INFO:mmcv.runner.runner:Epoch [28][100/116] lr: 0.01000, eta: 0:04:47, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0042 INFO:mmcv.runner.runner:Epoch [29][100/116] lr: 0.01000, eta: 0:04:39, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0038 INFO:mmcv.runner.runner:Epoch [30][100/116] lr: 0.01000, eta: 0:04:31, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch(train) [30][5] loss: 0.0447, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [31][100/116] lr: 0.00100, eta: 0:04:23, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0044 INFO:mmcv.runner.runner:Epoch [32][100/116] lr: 0.00100, eta: 0:04:15, time: 0.075, data_time: 0.013, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [33][100/116] lr: 0.00100, eta: 0:04:07, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0050 INFO:mmcv.runner.runner:Epoch [34][100/116] lr: 0.00100, eta: 0:03:59, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [35][100/116] lr: 0.00100, eta: 0:03:51, time: 0.074, data_time: 0.010, memory: 456, loss: 0.0044 INFO:mmcv.runner.runner:Epoch(train) [35][5] loss: 0.0343, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [36][100/116] lr: 0.00100, eta: 0:03:43, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [37][100/116] lr: 0.00100, eta: 0:03:36, time: 0.075, data_time: 0.010, memory: 456, loss: 0.0036 INFO:mmcv.runner.runner:Epoch [38][100/116] lr: 0.00100, eta: 0:03:28, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch [39][100/116] lr: 0.00100, eta: 0:03:20, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0054 INFO:mmcv.runner.runner:Epoch [40][100/116] lr: 0.00100, eta: 0:03:12, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0041 INFO:mmcv.runner.runner:Epoch(train) [40][5] loss: 0.0408, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [41][100/116] lr: 0.00010, eta: 0:03:04, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0046 INFO:mmcv.runner.runner:Epoch [42][100/116] lr: 0.00010, eta: 0:02:57, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [43][100/116] lr: 0.00010, eta: 0:02:49, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0029 INFO:mmcv.runner.runner:Epoch [44][100/116] lr: 0.00010, eta: 0:02:41, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0033 INFO:mmcv.runner.runner:Epoch [45][100/116] lr: 0.00010, eta: 0:02:34, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0030 INFO:mmcv.runner.runner:Epoch(train) [45][5] loss: 0.0346, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [46][100/116] lr: 0.00010, eta: 0:02:26, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [47][100/116] lr: 0.00010, eta: 0:02:18, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0030 INFO:mmcv.runner.runner:Epoch [48][100/116] lr: 0.00010, eta: 0:02:10, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0032 INFO:mmcv.runner.runner:Epoch [49][100/116] lr: 0.00010, eta: 0:02:03, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0040 INFO:mmcv.runner.runner:Epoch [50][100/116] lr: 0.00010, eta: 0:01:55, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0033 INFO:mmcv.runner.runner:Epoch(train) [50][5] loss: 0.0390, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [51][100/116] lr: 0.00001, eta: 0:01:47, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0039 INFO:mmcv.runner.runner:Epoch [52][100/116] lr: 0.00001, eta: 0:01:40, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch [53][100/116] lr: 0.00001, eta: 0:01:32, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0036 INFO:mmcv.runner.runner:Epoch [54][100/116] lr: 0.00001, eta: 0:01:24, time: 0.075, data_time: 0.012, memory: 456, loss: 0.0061 INFO:mmcv.runner.runner:Epoch [55][100/116] lr: 0.00001, eta: 0:01:17, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch(train) [55][5] loss: 0.0326, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [56][100/116] lr: 0.00001, eta: 0:01:09, time: 0.076, data_time: 0.010, memory: 456, loss: 0.0027 INFO:mmcv.runner.runner:Epoch [57][100/116] lr: 0.00001, eta: 0:01:02, time: 0.077, data_time: 0.011, memory: 456, loss: 0.0031 INFO:mmcv.runner.runner:Epoch [58][100/116] lr: 0.00001, eta: 0:00:54, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0028 INFO:mmcv.runner.runner:Epoch [59][100/116] lr: 0.00001, eta: 0:00:46, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [60][100/116] lr: 0.00001, eta: 0:00:39, time: 0.074, data_time: 0.011, memory: 456, loss: 0.0035 INFO:mmcv.runner.runner:Epoch(train) [60][5] loss: 0.0372, top1: 1.0000, top5: 1.0000 INFO:mmcv.runner.runner:Epoch [61][100/116] lr: 0.00001, eta: 0:00:31, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0040 INFO:mmcv.runner.runner:Epoch [62][100/116] lr: 0.00001, eta: 0:00:23, time: 0.076, data_time: 0.012, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [63][100/116] lr: 0.00001, eta: 0:00:16, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0034 INFO:mmcv.runner.runner:Epoch [64][100/116] lr: 0.00001, eta: 0:00:08, time: 0.075, data_time: 0.011, memory: 456, loss: 0.0032 INFO:mmcv.runner.runner:Epoch [65][100/116] lr: 0.00001, eta: 0:00:01, time: 0.076, data_time: 0.011, memory: 456, loss: 0.0032 INFO:mmcv.runner.runner:Epoch(train) [65][5] loss: 0.0310, top1: 1.0000, top5: 1.0000
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 93/93, 89.6 task/s, elapsed: 1s, ETA: 0sTop 1: 100.00% Top 5: 100.00%
Hey, Did you use the same training configuration file as example_dataset?
Hey, Did you use the same training configuration file as example_dataset?
train.yaml
argparse_cfg: gpus: bind_to: processor_cfg.gpus help: number of gpus work_dir: bind_to: processor_cfg.work_dir help: the dir to save logs and models batch_size: bind_to: processor_cfg.batch_size resume_from: bind_to: processor_cfg.resume_from help: the checkpoint file to resume from
processor_cfg: type: 'processor.recognition.train' workers: 16
model_cfg: type: 'models.backbones.ST_GCN_18' in_channels: 3 num_class: 10 edge_importance_weighting: True graph_cfg: layout: 'coco' strategy: 'spatial' loss_cfg: type: 'torch.nn.CrossEntropyLoss'
dataset_cfg:
- type: "datasets.DataPipeline"
data_source:
type: "datasets.SkeletonLoader"
data_dir: ./data/actions_as_space_time_shapes
num_track: 2
num_keypoints: 17
repeat: 20
pipeline:
- {type: "datasets.skeleton.normalize_by_resolution"}
- {type: "datasets.skeleton.mask_by_visibility"}
- {type: "datasets.skeleton.pad_zero", size: 150 }
- {type: "datasets.skeleton.random_crop", size: 150 }
- {type: "datasets.skeleton.simulate_camera_moving"}
- {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
- {type: "datasets.skeleton.to_tuple"}
- type: "datasets.DataPipeline"
data_source:
type: "datasets.SkeletonLoader"
data_dir: ./data/actions_as_space_time_shapes
num_track: 2
num_keypoints: 17
pipeline:
- {type: "datasets.skeleton.normalize_by_resolution"}
- {type: "datasets.skeleton.mask_by_visibility"}
- {type: "datasets.skeleton.pad_zero", size: 300 }
- {type: "datasets.skeleton.random_crop", size: 300 }
- {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]}
- {type: "datasets.skeleton.to_tuple"}
batch_size: 16 gpus: 4
optimizer_cfg: type: 'torch.optim.SGD' lr: 0.1 momentum: 0.9 nesterov: true weight_decay: 0.0001
workflow: [['train', 5], ['val', 1]] work_dir: ./work_dir/recognition/st_gcn/actions_as_space_time_shapes total_epochs: 65 training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks:
Hey, Did you use the same training configuration file as example_dataset?
train.yaml
argparse_cfg: gpus: bind_to: processor_cfg.gpus help: number of gpus work_dir: bind_to: processor_cfg.work_dir help: the dir to save logs and models batch_size: bind_to: processor_cfg.batch_size resume_from: bind_to: processor_cfg.resume_from help: the checkpoint file to resume from processor_cfg: type: 'processor.recognition.train' workers: 16 # model setting model_cfg: type: 'models.backbones.ST_GCN_18' in_channels: 3 num_class: 10 edge_importance_weighting: True graph_cfg: layout: 'coco' strategy: 'spatial' loss_cfg: type: 'torch.nn.CrossEntropyLoss' # dataset setting dataset_cfg: # training set - type: "datasets.DataPipeline" data_source: type: "datasets.SkeletonLoader" data_dir: ./data/actions_as_space_time_shapes num_track: 2 num_keypoints: 17 repeat: 20 pipeline: - {type: "datasets.skeleton.normalize_by_resolution"} - {type: "datasets.skeleton.mask_by_visibility"} - {type: "datasets.skeleton.pad_zero", size: 150 } - {type: "datasets.skeleton.random_crop", size: 150 } - {type: "datasets.skeleton.simulate_camera_moving"} - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]} - {type: "datasets.skeleton.to_tuple"} - type: "datasets.DataPipeline" data_source: type: "datasets.SkeletonLoader" data_dir: ./data/actions_as_space_time_shapes num_track: 2 num_keypoints: 17 pipeline: - {type: "datasets.skeleton.normalize_by_resolution"} - {type: "datasets.skeleton.mask_by_visibility"} - {type: "datasets.skeleton.pad_zero", size: 300 } - {type: "datasets.skeleton.random_crop", size: 300 } - {type: "datasets.skeleton.transpose", order: [0, 2, 1, 3]} - {type: "datasets.skeleton.to_tuple"} # dataloader setting batch_size: 16 gpus: 4 # optimizer setting optimizer_cfg: type: 'torch.optim.SGD' lr: 0.1 momentum: 0.9 nesterov: true weight_decay: 0.0001 # runtime setting workflow: [['train', 5], ['val', 1]] work_dir: ./work_dir/recognition/st_gcn/actions_as_space_time_shapes total_epochs: 65 training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip: resume_from: load_from:
Thanks! I will try my training again and show my results .
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
great!thank you👍~so,the only diff is add the option grad_clip ?
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
great!thank you~so,the only diff is add the option grad_clip ?
Yes
I can't thank enough people on this thread that found the error !
i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py), `def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs
self.optimizer.zero_grad() self.outputs['loss'].backward() self.optimizer.step() self.call_hook('after_train_iter') self._iter += 1 self.call_hook('after_train_epoch') self._epoch += 1`
the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py ` def after_train_iter(self, runner):
runner.optimizer.zero_grad() runner.outputs['loss'].backward() if self.grad_clip is not None: self.clip_grads(runner.model.parameters()) runner.optimizer.step() `
did not know why the function does not run actually. so manually add these operations in the runn.py , then ,the loss could decrease ... `self.optimizer.zero_grad()
self.outputs['loss'].backward() self.optimizer.step()
`
Thanks, I modified the code according to your suggestion, after training is completely correct
Hey,
Did anyone get this error after adding code to runner.py?
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I can't thank enough people on this thread that found the error !
Hey,
Does it work for you?
RuntimeError the error seems like you have two times backward ? i did not met it before...
The process is still running, the loss function is decreasing, which was not the case before the following modificiation:
I did not change anything under /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py
The process is still running, the loss function is decreasing, which was not the case before the following modificiation:
- Add grad_clip: under optimizer_config: in the training.yaml file
I did not change anything under /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py
@CamilleMaurice @jiawenhao2015 Ok. Thank you.
Hey,
Anyone has idea how to get result on single video for trained model?
@rashidch Have you tried to create a configuration file similar to test.yaml ?
@rashidch Have you tried to create a configuration file similar to test.yaml ?
Yeah.
@rashidch Then you are able to get the result on a single video for a trained model through using test.yaml but you are looking for a more flexible way ?
@rashidch Then you are able to get the result on a single video for a trained model through using test.yaml but you are looking for a more flexible way ?
Right now, I only get test accuracy on test data. I did not implement single video inference yet. I want to implement it, but I was little busy.
@rashidch Then you are able to get the result on a single video for a trained model through using test.yaml but you are looking for a more flexible way ?
I want to implement single video inference where we can show frame by frame actions recognized by the system in video.
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
Thank you very much. It worked for me.
Out of curiosity what is grad_clip?
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
Thank you very much. It worked for me.
Out of curiosity what is grad_clip?
@vivek87799 Did you get your answer now? I want to know what is grad_clip, too
i found that the core training phrase is done in the mmcv module(in my machine,is at /xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/runner.py), `def train(self, data_loader, *kwargs): self.model.train() self.mode = 'train' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i, data_batch in enumerate(data_loader): self._inner_iter = i self.call_hook('before_train_iter') outputs = self.batch_processor( self.model, data_batch, train_mode=True, **kwargs) if not isinstance(outputs, dict): raise TypeError('batch_processor() must return a dict') if 'log_vars' in outputs: self.log_buffer.update(outputs['log_vars'], outputs['num_samples']) self.outputs = outputs
self.optimizer.zero_grad() self.outputs['loss'].backward() self.optimizer.step() self.call_hook('after_train_iter') self._iter += 1 self.call_hook('after_train_epoch') self._epoch += 1`
the loss backward opeation is done by the hook function , /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/optimizer.py ` def after_train_iter(self, runner):
runner.optimizer.zero_grad() runner.outputs['loss'].backward() if self.grad_clip is not None: self.clip_grads(runner.model.parameters()) runner.optimizer.step() `
did not know why the function does not run actually. so manually add these operations in the runn.py , then ,the loss could decrease ... `self.optimizer.zero_grad()
self.outputs['loss'].backward() self.optimizer.step()
`
Thanks, I modified the code according to your suggestion, after training is completely correct
Hey,
Did anyone get this error after adding code to runner.py?
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
Have you fixed it?
@rashidch @jiawenhao2015 After training the model and obtaining the test results, how to print out the single video classification results
我发现核心训练短语是在mmcv模块中完成的(在我的机器上是/xxxxxxx/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg /mmcv/runner/runner.py), `DEF培养(个体,data_loader,* kwargs): self.model.train() self.mode = '训练' self.data_loader = data_loader self._max_iters = self._max_epochs len(data_loader) self.call_hook('before_train_epoch') for i,enumerate(data_loader)中的 data_batch :self._inner_iter = i self.call_hook('before_train_iter') 输出= self.batch_processor( self.model,data_batch,train_mode = True ,** kwargs) 如果不是isinstance(outputs,dict): 提高TypeError('batch_processor()必须返回一个dict') 如果输出中为'log_vars': self.log_buffer.update(outputs ['log_vars'],output ['num_samples']) self.outputs =输出
self.optimizer.zero_grad() self.outputs['loss'].backward() self.optimizer.step() self.call_hook('after_train_iter') self._iter += 1 self.call_hook('after_train_epoch') self._epoch += 1`
钩子函数 /share/jiawenhao/miniconda3/lib/python3.7/site-packages/mmcv-0.4.3-py3.7-linux-x86_64.egg/mmcv/runner/hooks/ optimizer.py ` 高清after_train_iter(个体经营,亚军):
runner.optimizer.zero_grad() runner.outputs['loss'].backward() if self.grad_clip is not None: self.clip_grads(runner.model.parameters()) runner.optimizer.step() `
不知道为什么该功能实际上无法运行。 所以在runn.py中手动添加这些操作, 然后,损失可能会减少... `self.optimizer.zero_grad()
self.outputs['loss'].backward() self.optimizer.step()
`
谢谢,我在训练完全正确之后根据您的建议修改了代码
嘿, 向runner.py添加代码后,有人收到此错误吗?
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
你修好了吗? why?Can you tell me?
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
thank you! it running!
Finally, I modified the training_hooks configuration of the train.yaml file, and the changes are as follows:
training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks: - type: TextLoggerHook checkpoint_config: interval: 5 optimizer_config: grad_clip:
Can you tell me how big your data set is?
Hey, I want to prepare custom dataset from videos which have actions: CALL: answer phone call COUG: cough DRIN: drink water SCRA: scratch head SNEE: sneeze STRE: stretch arms WAVE: wave hand WIPE: wipe glasses
I am using this dataset: https://web.bii.a-star.edu.sg/~chengli/FluRecognition.html
Can explain to me the following terms from build_dataset_example.yaml?
How I should calculate image_size, pixel_std, image_mean, image_std this video dataset?
I have tried preparing the dataset using default parameter and started the training process but the training loss does not decrease and accuracy was 0.000.
INFO:mmcv.runner.runner:Epoch [11][100/840] lr: 0.10000, eta: 0:43:07, time: 0.060, data_time: 0.026, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [11][200/840] lr: 0.10000, eta: 0:43:02, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [11][300/840] lr: 0.10000, eta: 0:42:58, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4406 INFO:mmcv.runner.runner:Epoch [11][400/840] lr: 0.10000, eta: 0:42:53, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4408 INFO:mmcv.runner.runner:Epoch [11][500/840] lr: 0.10000, eta: 0:42:49, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4415 INFO:mmcv.runner.runner:Epoch [11][600/840] lr: 0.10000, eta: 0:42:44, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4418 INFO:mmcv.runner.runner:Epoch [11][700/840] lr: 0.10000, eta: 0:42:40, time: 0.058, data_time: 0.023, memory: 2344, loss: 2.4420 INFO:mmcv.runner.runner:Epoch [11][800/840] lr: 0.10000, eta: 0:42:35, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [12][100/840] lr: 0.10000, eta: 0:42:18, time: 0.061, data_time: 0.027, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [12][200/840] lr: 0.10000, eta: 0:42:14, time: 0.059, data_time: 0.026, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch [12][300/840] lr: 0.10000, eta: 0:42:10, time: 0.059, data_time: 0.026, memory: 2344, loss: 2.4410 INFO:mmcv.runner.runner:Epoch [12][400/840] lr: 0.10000, eta: 0:42:05, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4407 INFO:mmcv.runner.runner:Epoch [12][500/840] lr: 0.10000, eta: 0:42:01, time: 0.059, data_time: 0.026, memory: 2344, loss: 2.4412 INFO:mmcv.runner.runner:Epoch [12][600/840] lr: 0.10000, eta: 0:41:57, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [12][700/840] lr: 0.10000, eta: 0:41:52, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [12][800/840] lr: 0.10000, eta: 0:41:47, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4425 INFO:mmcv.runner.runner:Epoch [13][100/840] lr: 0.10000, eta: 0:41:31, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [13][200/840] lr: 0.10000, eta: 0:41:27, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [13][300/840] lr: 0.10000, eta: 0:41:22, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4399 INFO:mmcv.runner.runner:Epoch [13][400/840] lr: 0.10000, eta: 0:41:17, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4418 INFO:mmcv.runner.runner:Epoch [13][500/840] lr: 0.10000, eta: 0:41:13, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4417 INFO:mmcv.runner.runner:Epoch [13][600/840] lr: 0.10000, eta: 0:41:08, time: 0.059, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [13][700/840] lr: 0.10000, eta: 0:41:03, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [13][800/840] lr: 0.10000, eta: 0:40:59, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [14][100/840] lr: 0.10000, eta: 0:40:43, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [14][200/840] lr: 0.10000, eta: 0:40:39, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [14][300/840] lr: 0.10000, eta: 0:40:34, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4413 INFO:mmcv.runner.runner:Epoch [14][400/840] lr: 0.10000, eta: 0:40:29, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4411 INFO:mmcv.runner.runner:Epoch [14][500/840] lr: 0.10000, eta: 0:40:24, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [14][600/840] lr: 0.10000, eta: 0:40:19, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4417 INFO:mmcv.runner.runner:Epoch [14][700/840] lr: 0.10000, eta: 0:40:15, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [14][800/840] lr: 0.10000, eta: 0:40:10, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4424 INFO:mmcv.runner.runner:Epoch [15][100/840] lr: 0.10000, eta: 0:39:55, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4422 INFO:mmcv.runner.runner:Epoch [15][200/840] lr: 0.10000, eta: 0:39:50, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch [15][300/840] lr: 0.10000, eta: 0:39:45, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4410 INFO:mmcv.runner.runner:Epoch [15][400/840] lr: 0.10000, eta: 0:39:40, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4410 INFO:mmcv.runner.runner:Epoch [15][500/840] lr: 0.10000, eta: 0:39:36, time: 0.059, data_time: 0.024, memory: 2344, loss: 2.4421 INFO:mmcv.runner.runner:Epoch [15][600/840] lr: 0.10000, eta: 0:39:31, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4411 INFO:mmcv.runner.runner:Epoch [15][700/840] lr: 0.10000, eta: 0:39:26, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4417 INFO:mmcv.runner.runner:Epoch [15][800/840] lr: 0.10000, eta: 0:39:21, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch(train) [15][18] loss: 2.2971, top1: 0.0000, top5: 0.0000 INFO:mmcv.runner.runner:Epoch [16][100/840] lr: 0.10000, eta: 0:39:07, time: 0.060, data_time: 0.025, memory: 2344, loss: 2.4426 INFO:mmcv.runner.runner:Epoch [16][200/840] lr: 0.10000, eta: 0:39:02, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4419 INFO:mmcv.runner.runner:Epoch [16][300/840] lr: 0.10000, eta: 0:38:57, time: 0.058, data_time: 0.024, memory: 2344, loss: 2.4407 INFO:mmcv.runner.runner:Epoch [16][400/840] lr: 0.10000, eta: 0:38:52, time: 0.058, data_time: 0.025, memory: 2344, loss: 2.4410
I have used the following train.yaml:
argparse_cfg: gpus: bind_to: processor_cfg.gpus help: number of gpus work_dir: bind_to: processor_cfg.work_dir help: the dir to save logs and models batch_size: bind_to: processor_cfg.batch_size resume_from: bind_to: processor_cfg.resume_from help: the checkpoint file to resume from
processor_cfg: type: 'processor.recognition.train' workers: 2
model setting model_cfg: type: 'models.backbones.ST_GCN_18' in_channels: 3 num_class: 8 edge_importance_weighting: True graph_cfg: layout: 'coco' strategy: 'spatial' loss_cfg: type: 'torch.nn.CrossEntropyLoss'
dataset setting dataset_cfg:
training set
dataloader setting batch_size: 32 gpus: 3
optimizer setting optimizer_cfg: type: 'torch.optim.SGD' lr: 0.1 momentum: 0.9 nesterov: true weight_decay: 0.0001
runtime setting workflow: [['train', 5], ['val', 1]] work_dir: ./work_dir/recognition/st_gcn/symptoms_data total_epochs: 65 training_hooks: lr_config: policy: 'step' step: [20, 30, 40, 50] log_config: interval: 100 hooks:
and build_dataset_example.yaml:
processor_cfg: type: "processor.skeleton_dataset.build" gpus: 1 worker_per_gpu: 2 video_dir: data/symptoms_data/videos out_dir: "data/symptoms_data/dataset" category_annotation: resource/category_annotations_symptoms.json detection_cfg: model_cfg: configs/mmdet/cascade_rcnn_r50_fpn_1x.py checkpoint_file: mmskeleton://mmdet/cascade_rcnn_r50_fpn_20e bbox_thre: 0.8 estimation_cfg: model_cfg: configs/pose_estimation/hrnet/pose_hrnet_w32_256x192_test.yaml checkpoint_file: mmskeleton://pose_estimation/pose_hrnet_w32_256x192 data_cfg: image_size:
argparse_cfg: gpus: bind_to: processor_cfg.gpus help: number of gpus worker_per_gpu: bind_to: processor_cfg.worker_per_gpu help: number of workers for each gpu video_dir: bind_to: processor_cfg.video_dir help: folder for videos category_annotation: bind_to: processor_cfg.category_annotation help: a json file recording video category annotation out_dir: bind_to: processor_cfg.out_dir help: folder for storing output dataset skeleton_model: bind_to: processor_cfg.estimation_cfg.model_cfg skeleton_checkpoint: bind_to: processor_cfg.estimation_cfg.checkpoint_file detection_model: bind_to: processor_cfg.detection_cfg.model_cfg detection_checkpoint: bind_to: processor_cfg.detection_cfg.checkpoint_file
Why the loss not decreasing? How to get correct configuration parameters for building custom dataset?