Open zjhthu opened 5 years ago
@tobiasploetz
We also observed instability of the training (also without using the N3 block) in later stages of the training when training on the St. Peters dataset.
Since accuracy on the validation set peaked early anyway (<150k iterations), we did not bother investigating this issue too deeply.
Hi @tobiasploetz ,
we conducted more experiments. The following is the result. We observed that the training failed after some iterations (~150k for St. Peters, ~50k for Brown) on both datasets. On the brown_bm_3_05
dataset, we got good result acc_qt_auc20_ours_ransac=0.5111 (0.5100 in the paper)
. However, on the St. Peters
dataset, we got acc_qt_auc20_ours_ransac=0.5263 (0.5740 in the paper)
. We use the default training configuration in https://github.com/vcg-uvic/learned-correspondence-release. Would you like to provide more details about your training? Do we need to run more the one time and select the best one on St. Peters, or is there anything else we need to modify?
Figure 1: training loss on St. Peters dataset
Figure 2: val_acc and test_acc on St. Peters dataset, on test dataset acc_qt_auc20_ours_ransac=0.5263
Figure 3: training loss on brown_bm_3_05 dataset
Figure 4: val_acc and test_acc on brown_bm_3_05 dataset, on test dataset acc_qt_auc20_ours_ransac=0.5111
Hi @sundw2014,
I will look into this shortly. For the time being, here are the training curves that we got on StPeters:
Fig. 1: training loss on St. Peters dataset
_Fig. 2: val_acc and testacc on St. Peters dataset
For us, training broke down roughly at iteration 250k.
Bests, Tobias
Hi @sundw2014,
just a quick update on this issue. I ran some experiments and here is what I found:
1) Running the code on Cuda 9 + GTX 1080 or Titan X works most of the times (I observed one training that crashed after ~130k iterations, the other trainings went fine and reached comparable numbers). 2) Running the code on Cuda 10 + RTX 2080 always failed after a varying number of epochs :(
So it seems to be an issue of the Cuda/GPU/CuDNN version that is used. Can you provide specifics about your system?
Bests, Tobias
Hi @tobiasploetz ,
I am sorry for getting back to you so late. We run the code on CUDA9.2, Tesla M40 24GB, Python 3.5.4 (from anaconda).
Best Regards
Hi @sundw2014,
I think I found the culprit that causes the unstable training. The original implementation of the classification loss contains this line.
classif_losses = -tf.log(tf.nn.sigmoid(c * self.logits))
This results in infs when the argument to the sigmoid becomes small. Changing the above line to the numerical stable log_sigmoid function solves the numerical problems.
classif_losses = -tf.log_sigmoid(c * self.logits)
However, things are more complex because the training is now diverging pretty quickly. My current guess is that the numerical unstable implementation was implicitly filtering out some outlier predictions (since training batches with non-finite parameter gradients are skipped). With the numerical stable log-sigmoid implementation, however, these training batches are not skipped and hence the outliers will affect the parameter updates.
I am currently investigating this further :)
Bests, Tobias
Cool! Thanks for your help. I look forward to hearing from you again.
I find training is unstable when using n3net in correspondences experiments, the training loss increases suddenly and the valid accuracy drops simultaneously. It falls into bad local minima.
So, has anyone encountered this problem? I use the default config for training.