training is unstable for correspondences experiment

zjhthu commented 5 years ago

I find training is unstable when using n3net in correspondences experiments, the training loss increases suddenly and the valid accuracy drops simultaneously. It falls into bad local minima.

So, has anyone encountered this problem? I use the default config for training.

training-loss valid-acc

zjhthu commented 5 years ago

@tobiasploetz

tobiasploetz commented 5 years ago

We also observed instability of the training (also without using the N3 block) in later stages of the training when training on the St. Peters dataset.

Since accuracy on the validation set peaked early anyway (<150k iterations), we did not bother investigating this issue too deeply.

sundw2014 commented 5 years ago

Hi @tobiasploetz , we conducted more experiments. The following is the result. We observed that the training failed after some iterations (~150k for St. Peters, ~50k for Brown) on both datasets. On the brown_bm_3_05 dataset, we got good result acc_qt_auc20_ours_ransac=0.5111 (0.5100 in the paper). However, on the St. Peters dataset, we got acc_qt_auc20_ours_ransac=0.5263 (0.5740 in the paper). We use the default training configuration in https://github.com/vcg-uvic/learned-correspondence-release. Would you like to provide more details about your training? Do we need to run more the one time and select the best one on St. Peters, or is there anything else we need to modify?

Figure 1: training loss on St. Peters dataset

Figure 2: val_acc and test_acc on St. Peters dataset, on test dataset acc_qt_auc20_ours_ransac=0.5263

Figure 3: training loss on brown_bm_3_05 dataset

Figure 4: val_acc and test_acc on brown_bm_3_05 dataset, on test dataset acc_qt_auc20_ours_ransac=0.5111

tobiasploetz commented 5 years ago

Hi @sundw2014,

I will look into this shortly. For the time being, here are the training curves that we got on StPeters:

Fig. 1: training loss on St. Peters dataset

Fig. 2: val_acc and test_acc on St. Peters dataset _Fig. 2: val_acc and testacc on St. Peters dataset

For us, training broke down roughly at iteration 250k.

Bests, Tobias

tobiasploetz commented 5 years ago

Hi @sundw2014,

just a quick update on this issue. I ran some experiments and here is what I found:

1) Running the code on Cuda 9 + GTX 1080 or Titan X works most of the times (I observed one training that crashed after ~130k iterations, the other trainings went fine and reached comparable numbers). 2) Running the code on Cuda 10 + RTX 2080 always failed after a varying number of epochs :(

So it seems to be an issue of the Cuda/GPU/CuDNN version that is used. Can you provide specifics about your system?

Bests, Tobias

sundw2014 commented 5 years ago

Hi @tobiasploetz ,

I am sorry for getting back to you so late. We run the code on CUDA9.2, Tesla M40 24GB, Python 3.5.4 (from anaconda).

Best Regards

tobiasploetz commented 5 years ago

Hi @sundw2014,

I think I found the culprit that causes the unstable training. The original implementation of the classification loss contains this line.

classif_losses = -tf.log(tf.nn.sigmoid(c * self.logits))

This results in infs when the argument to the sigmoid becomes small. Changing the above line to the numerical stable log_sigmoid function solves the numerical problems.

classif_losses = -tf.log_sigmoid(c * self.logits)

However, things are more complex because the training is now diverging pretty quickly. My current guess is that the numerical unstable implementation was implicitly filtering out some outlier predictions (since training batches with non-finite parameter gradients are skipped). With the numerical stable log-sigmoid implementation, however, these training batches are not skipped and hence the outliers will affect the parameter updates.

I am currently investigating this further :)

Bests, Tobias

sundw2014 commented 5 years ago

Cool! Thanks for your help. I look forward to hearing from you again.

visinf / n3net

training is unstable for correspondences experiment #5