The Compute for Evaluation may not stable

lorenmt / mtan

The implementation of "End-to-End Multi-Task Learning with Attention" [CVPR 2019].

https://shikun.io/projects/multi-task-attention-network

MIT License

675 stars 109 forks source link

The Compute for Evaluation may not stable #28

Closed minygd closed 4 years ago

minygd commented 4 years ago

Hello, sorry to bother. But I found that using different batch_size settings may cause different Evaluation Results. For Example, the follows are Metrics in SGNet-MTAN, and the batch_size is used as test_batch_size.(RMSE is modified according to the equation)

Batch_size	RMSE	Abs_Rel	M-IOU	Pixel_ACC	MEAN	MED	<11.25	Times
1	0.7974	0.2550	0.1993	0.5536	30.9889	27.0419	0.1986	49s
2	0.8138	0.2545	0.1993	0.5536	30.9630	26.9051	0.1988	47s
4	0.8241	0.2536	0.1993	0.5532	30.9538	26.8244	0.1988	47s
8	0.8321	0.2526	0.1993	0.5531	30.9388	26.7398	0.1989	46s

lorenmt commented 4 years ago

Hi @minygd,

I believe this issue is possibly due to that the BatchNorm parameters in the shared network are oscillating for different tasks. And the sampling for the test dataset is not in a fixed order.

But from your table above, except for the RMSE, the rest metrics look quite stable to me.

For the sanity check, could you

Turn off the shuffle=True in the test dataset, and see whether it will produce a more stable result?
Use Conditional BatchNorm in the shared network.

Here is my conditional BN code:

class ConditionalBatchNorm2d(nn.Module):
    def __init__(self, num_features, num_classes):
        super().__init__()
        self.num_features = num_features
        self.bn = nn.BatchNorm2d(num_features, affine=False)
        self.embed = nn.Embedding(num_classes, num_features * 2)
        self.embed.weight.data[:, :num_features].normal_(1, 0.02)  # Initialise scale at N(1, 0.02)
        self.embed.weight.data[:, num_features:].zero_()  # Initialise bias at 0

    def forward(self, x, y):
        out = self.bn(x)
        gamma, beta = self.embed(y).chunk(2, 1)
        out = gamma.view(-1, self.num_features, 1, 1) * out + beta.view(-1, self.num_features, 1, 1)
        return out

Let me know the result.

Thanks.

minygd commented 4 years ago

Hi @lorenmt, I think the issue maybe caused by the different Evaluation Perspective. The metric like RMSE should be aggregated one by one(Samples). But it is confusing that Abs_Rel also make some difference. I changed the TestDataLoader with shuffle=False, and the results are mentioned above. Thanks for your advise, I will change the model with Conditional BatchNorm and report the results as soon as possible. And by the way, would you mind tell me the reason why the Conditional BatchNorm matters? Thanks again for your kind and quick reply.

lorenmt commented 4 years ago

Hi,

I am not sure I understand your comment on "one by one aggregation" for RMSE metric, I thought we just compute the RMSE for each batch, and then report the average error across one batch? Also, could you confirm whether this issue is from this model specifically, or it's a general problem in all PyTorch-based evaluations? Could you observe a similar phenomenon from a standard image classification model, like VGG-16 on CIFAR100 as an example?

I am sorry that I just realized that it's not suitable using Conditional BN as I mentioned in the last comment. For dense prediction problem, we are having multiple labels for each specific input, so we don't have anything to condition with.

I misunderstood it as from the Visual Decathlon dataset, where we trying to solve multiple datasets in a single network, in this case, conditional BN is essential to condition on the correct dataset to compute corresponding normalization parameters.

minygd commented 4 years ago

Hi,

Yes, you are right. Thank you for introducing conditional BN to me. As I mentioned above, the RMSE is calculated as $RMSE = \sqrt{\frac{1}{HW} \sum_{N=HW}^{i}( (\hat{y_{i}} - y_{i} )^2 ))}$ when use Batch_size = 1 that is correct. But when we Change the Batch_size, the Equation will change to $RMSE = \sqrt{\frac{1}{BHW} \sum_{N=BHW}^{i}( (\hat{y_{i}} - y_{i} )^2 ))}$ . It looks fine, but when we try to calculate the mean value of all Groups(len(TestData) / Batch_size), the error will occur.

For example, when we set Batch_size = 2, the former one is calculated as $RMSE = \frac{1}{2*Group}(\sqrt{\frac{1}{HW} \sum_{N=HW}^{i}( (\hat{y_{i}} - y_{i} )^2 ))}$ and the latter one is $RMSE = \frac{1}{Group}\sqrt{\frac{1}{2HW} \sum_{N=BHW}^{i}( (\hat{y_{i}} - y_{i} )^2 ))}$ . And we can see the difference in Equations that the former one is $\sqrt{\frac{1}{2}}$ times than the latter one. Please let me know if My derivations whether explained. Thanks!

lorenmt commented 4 years ago

Yes, thanks for the explanation. I think I understand the issue now. Since it's a mathematical problem, I believe an easy fix should be just to compute average across H*W inside the root (pixel-wise average), and then average across batch size outside the root, and then it should be consistent across any number of batch size.

The rest of the metric, for example, Abs Rel having small variations is fine to me. If you really want to dig deeper, you can confirm whether the prediction from the same data in different batch size is the same. Otherwise, it should be another numerical problem.

Best, Sk.

minygd commented 4 years ago

Thanks, my advice for calculating the metric like RMSE is just as you mentioned. It's a nice disccusion for me. Best Regards, H-X.