Closed amjltc295 closed 5 years ago
Hi, @amjltc295 I agree that printing the training time would be good, but there are some points that can be improved.
Train Epoch: 1 [45056/54000 (83%)] Loss: 0.523107 Time: 0.02s epoch : 1 epoch_time : 3.8233487606048584 loss : 1.0713483376323052 my_metric : 0.642586723663522 my_metric2 : 0.8484682095125786 val_epoch_time : 2.062150239944458 val_loss : 0.2415582835674286 val_my_metric : 0.9347896852355072 val_my_metric2 : 0.9883803215579711 Saving checkpoint: saved/Mnist_LeNet/1026_042054/checkpoint-epoch1.pth ... Saving current best: model_best.pth ...
Since it is displayed on each logging steps and just says 'Time', I think the batch time could be easily mistaken as the step time.
How do you think about doing it with this format
Train Epoch: 1 [45056/54000 (83%) 3.2s/4s eta] Loss: 0.523107
and just skipping the epoch time?
I also think that using datetime.now()
would be better for consistency(with base_trainer), but this would be minor point compared with above.
@SunQpark Sorry, I don't understand the meaning of 3.2s/4s eta
. Could you explain it a bit?
I think the purpose of printing time is to let users estimate the total training time and check if there is a bug that slow down the process. That's the reason I put batch time and epoch time. It wouldn't be hard to tell the batch time and step time; or we could add change it to Batch time
or move it inside the[ ]
.
Sorry for late response @amjltc295.
Previously I thought that printing time would be helpful to recognize progress of training, and to estimate when does it end. My comment meant for that, displaying time passed / total time
, but now I understand that this is what you intended.
However, I'm still not sure that this feature is suitable for this project. Checking the training time for debugging makes sense, but I don't think we should check that for every batches. I think using torch.autograd.profiler
before training is more suitable solution. How do you think about this?
torch.autograd.profiler
is a new thing for me. It some time for me to figure out how it works, but it seems to be a good solution.
I think it would be nice to know the training time. This could be seen by changing the logging configure as well, but the delta time may be hard to find if there are too many batches.