Closed bearpaw closed 7 years ago
Here are example logs, there apparently is a weird discrepancy occurring in the reported accuracy with some change introduced in cutorch sometime a few months ago. I haven't had the chance to get to the bottom of it, but I think the reported loss is still consistent. It might vary a bit depending on the training/validation split.
I train for 100 epochs then just run it for a few more epochs with the learning rate cut down to 5e-5. After that, I found more training doesn't help validation performance, though train performance will continue to go up.
Hi @anewell I trained the network with the default setting. However, the train and valid accuracy are low. Here is the command to train the model:
CUDA_VISIBLE_DEVICES=0 th main.lua -dataset mpii -expID default -dataDir ../data -expDir ../exp
and the train/valid logs are
I run the model on a single TITAN X GPU with CUDA 7.5. I'm using the latest torch and the CUDNN v5.1.3. It takes more than three days for 100 epochs. Would you give me some advices to train the model? Thank you.
@bearpaw Did you achieve the comparable results later? My results are very similar to yours.
@anewell According to your log, you use the same learning rate all the time as opposed to 'drop the learning rate once by a factor of 5 after validation accuracy plateaus'. Are we missing anything?
Have you pulled the latest version of the repository? The mechanics of a torch function had changed and messed up the on-the-fly accuracy calculation my code was doing. I pushed a fix pretty recently.
I usually drop the learning rate by 'branching' from the experiment which starts a separate training log. The example I put up just showed the first stage of training.
@anewell Did you mean these lines? It seems that it just mask out those uncertain predictions such that those preds are 1. I didn't understand why these could improve performance since a random guess should not be worse than just predicting 1 (unless the ground truth is also 1). Could you elaborate more on that? Thanks!
There are many joints in MPII that are not annotated often because they are cut out of the image. These cases should be ignored during evaluation, and the bug that came up mistakenly included them thus reporting misleading, lower performance.
The accuracy function we call during training is just an approximation of the real evaluation function. Instead of the original annotation, we use the ground truth heatmap to get a joint's location. This is useful because if there is no ground truth annotation or perhaps data augmentation crops a joint out of the input, we get a heatmap of all zeros which serves as a clear signal to ignore that joint.
It used to be that when you called the max function in Torch you would get the first index where the max occurred, meaning that if the ground truth heatmap were to consist of all zeros you would get (1,1). That was easy enough to check for and ignore. But at some point the max function changed to return any index with the max value which broke my simple check ignoring joints at location (1,1). Suddenly all the experiments were reporting much lower accuracies. Fortunately the fix was simple enough: to manually check that the max value of the heatmap is zero which is what you are seeing in those lines. That change was not to modify the network's predictions but instead to correctly get the ground truth values here.
Hope that clears it up!
@anewell I get it. It makes sense now. Thanks.
@anewell Would you mind telling me why adding 0.5 for all preds in validation would increase the performance? Is there systematical error for drawing gaussian or any other concerns?
When drawing the gaussian, we transform a pixel location from the original image space to a much lower resolution value in the heatmap space. We round this value down to get an integer index that will serve as the center of the gaussian. So, for example, in our heatmap a wrist might appear at the location (24, 30), but the reality is when doing that transformation the x value fell somewhere between 24.000 and 24.999. On average, the true x coordinate is in fact going to be closer to 24.5 (the center of the pixel), than 24.0 (the left edge of the pixel).
Hi, would you share the
train.log
andvalid.log
for reference? Since it could be a great help for monitoring the training process.ps. Did you train the model for only 100 epochs? Since the paper mentioned that you drop the learning rate several times. If so, at which epochs did you drop the learning rate? Thank you.