valeoai / SLidR

Official PyTorch implementation of "Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data"
Other
177 stars 26 forks source link

The train loss fluactuates at 7 but does not decrease #4

Closed ruomingzhai closed 2 years ago

ruomingzhai commented 2 years ago

As I said, I pretrained the model with superpixel-driven InfoNCE loss, but both the training and validating losses did not decrease and both of them fluctuate at 7 (e.g., 7.105, 7.25, 7.15). Is this normal for self-supervised learning? Is there anyone who encounters the same issue? Hope to get some clues from you thanks!

CSautier commented 2 years ago

Hi, That doesn't seem right, the loss should normally rapidly decrease to around 3.5, and then slowly decrease further, with the validation loss typically 1 point above the training loss. Can you specify what you tried to run exactly?

Corentin

ruomingzhai commented 2 years ago

Hi, That doesn't seem right, the loss should normally rapidly decrease to around 3.5, and then slowly decrease further, with the validation loss typically 1 point above the training loss. Can you specify what you tried to run exactly?

Corentin

Hi, CSautier! Thank you for patiently replying.

Actually, I deployed your superpixel-driven self-supervised method to the ScanNet indoor dataset but there are two details different from yours:

(1) For 2D images, I use a pretrained encoder-type 2D network, followed by a Linear layer and a batchnorm layer for 64-dimensional features, but without a decoder model. Meanwhile, the 3D points are trained with the Minkowski SR-UNet.

(2) since most of the 3D points are shown in more than one image in ScanNet RGB-D datasets, I compute the loss for each image, i.e., the query and positive samples are matrix stacked as [i,sp_num,sp_dim]. I accumulate all image-level loss for each scene.

Now, my loss normally rapidly decreases to around 4.9 and then slowly decreases further, but fluctuates around 4.7. I am not sure this is effective training. I hope I can get some constructive advice from you. For example, is it necessary to add a proper decoder 2D network before Linear layer? or should the loss be computed at scene-level for each scene?

CSautier commented 2 years ago

Hi, We found it important to have an output feature map sufficiently dense, as we average pool the features per superpixels, which can be of quite fine details. In practice we used dilated convolutions in the resnet encoder so as to keep a high dimension, without requiring fine-tuning, and and upsampling layer. The concurrent work PPKT, which has experiments on indoor datasets, reached a similar conclusion about the resolution, although they only used upsampling. On the extreme opposite, if you only have a single feature vector per image, your method ressemble that of DepthContrast which we found require a much bigger batch size, and is less well suited for semantic segmentation.

As for averaging the loss per image instead of per scene, we have tried it as well on nuScenes, and it worked a bit less well, but still gave quite good results. It might depend on the dataset indeed.

ruomingzhai commented 2 years ago

Hi, We found it important to have an output feature map sufficiently dense, as we average pool the features per superpixels, which can be of quite fine details. In practice we used dilated convolutions in the resnet encoder so as to keep a high dimension, without requiring fine-tuning, and and upsampling layer. The concurrent work PPKT, which has experiments on indoor datasets, reached a similar conclusion about the resolution, although they only used upsampling. On the extreme opposite, if you only have a single feature vector per image, your method ressemble that of DepthContrast which we found require a much bigger batch size, and is less well suited for semantic segmentation.

As for averaging the loss per image instead of per scene, we have tried it as well on nuScenes, and it worked a bit less well, but still gave quite good results. It might depend on the dataset indeed.

Thanks. You mean both the 3D and 2D networks using dilated ResNet or just the 3D one.

By the way, can the loss decrease to nearly 0.1? because I can not make it to near zero with lr=0.1 or 0.01.

CSautier commented 2 years ago

I meant for the 2D network. The 3D network is a UNet that will output in the same coordinate system (same resolution) than its input. This loss cannot decrease to zero, even more so if the encoder is frozen, it is expected and shouldn't be a problem.