Would you please also share your train.log and valid.log for reference?

bearpaw commented 7 years ago

Hi, would you share the train.log and valid.log for reference? Since it could be a great help for monitoring the training process.

ps. Did you train the model for only 100 epochs? Since the paper mentioned that you drop the learning rate several times. If so, at which epochs did you drop the learning rate? Thank you.

anewell commented 7 years ago

Here are example logs, there apparently is a weird discrepancy occurring in the reported accuracy with some change introduced in cutorch sometime a few months ago. I haven't had the chance to get to the bottom of it, but I think the reported loss is still consistent. It might vary a bit depending on the training/validation split.

train.txt valid.txt

I train for 100 epochs then just run it for a few more epochs with the learning rate cut down to 5e-5. After that, I found more training doesn't help validation performance, though train performance will continue to go up.

bearpaw commented 7 years ago

Hi @anewell I trained the network with the default setting. However, the train and valid accuracy are low. Here is the command to train the model:

CUDA_VISIBLE_DEVICES=0 th main.lua -dataset mpii -expID default -dataDir ../data -expDir ../exp

and the train/valid logs are

train.txt valid.txt

I run the model on a single TITAN X GPU with CUDA 7.5. I'm using the latest torch and the CUDNN v5.1.3. It takes more than three days for 100 epochs. Would you give me some advices to train the model? Thank you.

iammarvelous commented 7 years ago

@bearpaw Did you achieve the comparable results later? My results are very similar to yours.

@anewell According to your log, you use the same learning rate all the time as opposed to 'drop the learning rate once by a factor of 5 after validation accuracy plateaus'. Are we missing anything?

anewell commented 7 years ago

Have you pulled the latest version of the repository? The mechanics of a torch function had changed and messed up the on-the-fly accuracy calculation my code was doing. I pushed a fix pretty recently.

I usually drop the learning rate by 'branching' from the experiment which starts a separate training log. The example I put up just showed the first stage of training.

iammarvelous commented 7 years ago

@anewell Did you mean these lines? It seems that it just mask out those uncertain predictions such that those preds are 1. I didn't understand why these could improve performance since a random guess should not be worse than just predicting 1 (unless the ground truth is also 1). Could you elaborate more on that? Thanks!

anewell commented 7 years ago

There are many joints in MPII that are not annotated often because they are cut out of the image. These cases should be ignored during evaluation, and the bug that came up mistakenly included them thus reporting misleading, lower performance.

The accuracy function we call during training is just an approximation of the real evaluation function. Instead of the original annotation, we use the ground truth heatmap to get a joint's location. This is useful because if there is no ground truth annotation or perhaps data augmentation crops a joint out of the input, we get a heatmap of all zeros which serves as a clear signal to ignore that joint.

It used to be that when you called the max function in Torch you would get the first index where the max occurred, meaning that if the ground truth heatmap were to consist of all zeros you would get (1,1). That was easy enough to check for and ignore. But at some point the max function changed to return any index with the max value which broke my simple check ignoring joints at location (1,1). Suddenly all the experiments were reporting much lower accuracies. Fortunately the fix was simple enough: to manually check that the max value of the heatmap is zero which is what you are seeing in those lines. That change was not to modify the network's predictions but instead to correctly get the ground truth values here.

Hope that clears it up!

iammarvelous commented 7 years ago

@anewell I get it. It makes sense now. Thanks.

iammarvelous commented 7 years ago

@anewell Would you mind telling me why adding 0.5 for all preds in validation would increase the performance? Is there systematical error for drawing gaussian or any other concerns?

anewell commented 7 years ago

When drawing the gaussian, we transform a pixel location from the original image space to a much lower resolution value in the heatmap space. We round this value down to get an integer index that will serve as the center of the gaussian. So, for example, in our heatmap a wrist might appear at the location (24, 30), but the reality is when doing that transformation the x value fell somewhere between 24.000 and 24.999. On average, the true x coordinate is in fact going to be closer to 24.5 (the center of the pixel), than 24.0 (the left edge of the pixel).

princeton-vl / pose-hg-train

Would you please also share your train.log and valid.log for reference? #16