lzccccc / SMOKE

SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation
MIT License
709 stars 177 forks source link

Questions about horizontal flipping in augmentation #13

Open adrigrillo opened 4 years ago

adrigrillo commented 4 years ago

Hello!

First, I want to thank you for releasing the code of the architecture its quality.

During the last days, I have been going through the code and testing it with my own dataset (I created my own data loader). However, I have some doubts regarding the augmentation techniques, in concrete, with respect the flipping of the images, which are the following:

In first place, with certain probability the images are flipped horizontally. In this process, the data that is modified is:

This modified data is saved as the label in the pytorch dataset. As it can be seen in the __get_item__(self, idx). Along with a flag if the image have been flipped.

Then, when the network is calculating the loss during training, there is a method that decodes the rotation. This method uses the flipping flag to modify the rotation (in the inverse way of the data loader) of those values that were flipped.

In summary, the data loader multiplies by -1 the rotation (of the unmodified data) and saves it as the label (that I see correct) but then, the decoder, applies the same transformation to the data, so the prediction is set to the rotation of the unmodified data and, therefore, being incorrect with respect the label.

As I see it, the network should not worry about resetting the rotation to the original data as the data loader had taken care before. I am wrong?

In second place, with respect the rotation transformation, I believe that multiplying by -1 is not correct in the case of the horizontal flipping (it is for the vertical one).

The reasoning is that, for a car that has a rotation of 90 degrees (in KITTI, a vehicle looking forward) when the image is flipped horizontally does not change the rotation of the car so it remains 90 degrees. In python, the formula is:

if rotation < 0:
    rotation = -180 - rotation
else:
    rotation = 180 - rotation
return rotation

Thanks beforehand.

lzccccc commented 4 years ago

Hi,

Thanks for your kind words and the detailed description provided regarding the horizontal flip problems. I will start with your second question.

I double-checked the rotation transformation in the dataloader. Indeed, the way I modified it with horizontal flip is incorrect. The formula you provided is the right one. This might be the reason why the training loss converges so slow.

In the orientation decode process, what we regress is the angle between the camera ray and the heading orientation (as can be seen in Fig. 4 of our paper). We then change it back to KITTI form by +- pi/2 (https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/smoke_coder.py#L220). Since the rotation augmentation is incorrect, the orientation decoder also needs to be modified.

Thanks for pointing the mistake out. I would really appreciate it you can make a pull request to correct this. Otherwise, I will do it later since I have other projects occupied at this moment.

adrigrillo commented 4 years ago

In the orientation decode process, what we regress is the angle between the camera ray and the heading orientation (as can be seen in Fig. 4 of our paper). We then change it back to KITTI form by +- pi/2 (/smoke/modeling/smoke_coder.py@master#L220 ). Since the rotation augmentation is incorrect, the orientation decoder also needs to be modified.

That I understand. The part I refer is the following, where the flipping mask is used (https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/smoke_coder.py#L236,L247).

I do not understand why the angles are being modified back to the original ones as the transformed ones are saved in the targets.

In

https://github.com/lzccccc/SMOKE/blob/bc5d2bba66e2d66fa56b7b599d55457cb1a05b33/smoke/data/datasets/kitti.py#L156

the angle is modified. And in

https://github.com/lzccccc/SMOKE/blob/bc5d2bba66e2d66fa56b7b599d55457cb1a05b33/smoke/data/datasets/kitti.py#L181

it is saved in the targets.

So, for example, if the original angle (before flipping) is 45 degrees, the dataset loader saves the transformation, 135º. The network will learn to predict 135º as is the heading that can be seen in the image. However, in the decoding, we transform back to 45º (https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/smoke_coder.py#L236,L247) because the mask says the object was flipped. Then, when it comes to calculate the loss we use the 45º decoded (https://github.com/lzccccc/SMOKE/blob/bc5d2bba66e2d66fa56b7b599d55457cb1a05b33/smoke/modeling/heads/smoke_head/loss.py#L83,L87) but it is compared with the targets that has 135º (https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/heads/smoke_head/loss.py#L130,L133).

Is that correct or I am missing something.

Thanks for pointing the mistake out. I would really appreciate it you can make a pull request to correct this. Otherwise, I will do it later since I have other projects occupied at this moment.

I will try to do it as soon as possible, I am quite busy but I guess I could have some time this weekend.

lzccccc commented 4 years ago

I do not understand why the angles are being modified back to the original ones as the transformed ones are saved in the targets.

The angles are not modified back to the original ones. It is modified to the original KITTI format. The thing that causes confusion here is we have three orientations here (Let's first do not consider the horizontal flip here).

  1. alpha in KITTI label file represents the observation angle.
  2. roty in KITTI label file represents the global orientation.
  3. the modified observation angle as described in Sec. 4.2 of the paper (Under Eqn. 4).

The camera coordinate defined in KITTI is x positive = right, z positive = forward. We defined the object coordinate (or the body frame for each object) the same as the camera coordinate. Then we modified the observation angle by +- pi/2 to follow our coordinate design.

For example, you can take a look at 000001.txt and the image. The truck is facing the same direction as the reference car (or the recording platform). However, both alpha and roty are near -pi/2 which does not meet the previous definition.

We force the network to predict the modified observation angle and change back to roty or alpha if needed.

To sum up, we have three angles here. 1 and 2 are what we need. 3 is the network predicts.

During training, the ground-truth roty (2) is given. The network predicts the modified observation angle (3). The orientation decoder transforms the modified observation angle (3) to roty (2) and then compute the loss.

During testing, the network predicts the modified observation angle (3). The orientation decoder transforms the modified observation angle (3) to both alpha (1) and roty (2) for prediction.

adrigrillo commented 4 years ago

I understand and, therefore, all this transformations are done in https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/smoke_coder.py#L212-L234. Is that correct?

However, my question in this regards goes into the lines https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/smoke_coder.py#L236-L245. This block is only executed in training, when we have the flip_mask.

However, what I do not know why the angle is changed by +- PI when the sample has been flipped (so flip_mask == 1) and not for the ones not flipped.

lzccccc commented 4 years ago

The +-PI operation is found empirically to transform modified observation angle (3) to roty (2) under horizontal flip condition.

ZhxJia commented 4 years ago

I understand and, therefore, all this transformations are done in https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/smoke_coder.py#L212-L234. Is that correct?

However, my question in this regards goes into the lines https://github.com/lzccccc/SMOKE/blob/master/smoke/modeling/smoke_coder.py#L236-L245. This block is only executed in training, when we have the flip_mask.

However, what I do not know why the angle is changed by +- PI when the sample has been flipped (so flip_mask == 1) and not for the ones not flipped.

Hi @adrigrillo , your answers solved many of my doubts, thks. But I'm facing the same problem: since the target rotation_y have been flipped, why the rotation_y also changed at the training stage. https://github.com/lzccccc/SMOKE/blob/bc5d2bba66e2d66fa56b7b599d55457cb1a05b33/smoke/modeling/smoke_coder.py#L242 hoping for your answers,thanks.

adrigrillo commented 4 years ago

For my use case, I removed that flipping part in the training code and worked perfectly. If you need to explain the code lately, I recommend doing the same.

ZhxJia commented 4 years ago

Thanks for your reply. I have one more question to consult you. During the training stage, the training loss still reduce but the val loss basically unchanged after 30 epoches. Has this ever happened to you?

adrigrillo commented 4 years ago

Yes, that could happen or even started to increase. I have tried stopping the training and starting the process again with the last weights saved, restarting the optimizer.

However, you will not obtain big improvements one that step is reached.

jzstudent commented 2 years ago

Thanks for your reply. I have another question about the regression mask. During the training , it seems that the affine augmentation does not affect the final loss because of reg_mask, am I right? `if self.reg_loss == "DisL1": reg_loss_ori = F.l1_loss( predict_boxes3d["ori"] reg_mask, targets_regression reg_mask, reduction="sum") / (self.loss_weight[1] * self.max_objs)

        reg_loss_dim = F.l1_loss(
            predict_boxes3d["dim"] * reg_mask,
            targets_regression * reg_mask,
            reduction="sum") / (self.loss_weight[1] * self.max_objs)

        reg_loss_loc = F.l1_loss(
            predict_boxes3d["loc"] * reg_mask,
            targets_regression * reg_mask,
            reduction="sum") / (self.loss_weight[1] * self.max_objs)`