Closed ppyht2 closed 2 years ago
The running mean and variance in batch norm can sometimes be unstable when used with small batch sizes (e.g. < 6). Freezing batch norm just fixes these to zero and one, respectively. The weight and bias terms are still learned.
Instance normalization in the feature encoder makes more sense than batch norm because we want to derive each set of correlation features from only a single image. Although the shared encoder does not use instance normalization for this task, we didn't observe a decrease in performance.
I wouldn't expect the type of normalization in RAFT-Stereo to make-or-break training. If you could describe the training loss you observed or detail what you changed from the provided source code, I might be able to be of more help.
Thanks for the tip, I'm using a small batch size and there's definitely some interactions I don't fully understand here.
I've did a run with instance norm instead, it seems to be stable and converge much faster.
Thanks for the tip, closing this now.
@ppyht2 I saw model collapse too. I've changed the batch size per gpu to 16 for speed. I am wondering whether the scale mask ratio should be set to 0.125 for raftstereo-realtime. https://github.com/princeton-vl/RAFT-Stereo/blob/f1fa15abd34187d101806f65b813f4d9d6f93ab0/core/update.py#L137 As you described, have you added more instance norm layers for shared backbone https://github.com/princeton-vl/RAFT-Stereo/blob/f1fa15abd34187d101806f65b813f4d9d6f93ab0/core/raft_stereo.py#L29 ?
@zhujiagang Our bug was extremely silly in the end, we didn't realise the network included normalisation in the forward pass, so we were normalising twice with custom dataloder and causing numerical instability.
We didn't add more instance norm, but rather replaced batch norm with instance using the norm_fn
argument.
Hope this helps.
@ppyht2 Thank you very much. What I mean add more instance norm layers is exactly replaced batch norm with instance using the norm_fn argument for self.cnet.
Thanks for the great work Lahav and the team, iteratively refinement has been missing in stereo matching, and great work on the multi-level correlation lookup volume.
I have been using this code repo on my application with great success. However training is not very stable, the model seems to suffer some kind of mode collapse and predict the same for all inputs.
I'm just checking all the loose ends and came across the normalisation part. https://github.com/princeton-vl/RAFT-Stereo/blob/5c13878b617177da139cfeba79ac15b39b351963/train_stereo.py#L151
1) Why is batch norm frozen during training? Doesn't this defeat the purpose of adding a batch norm in the first place? 2) In the paper, instance norm is used instead of the batch norm for the context encoder, can you expand on this implementation detail? How will this impact the model when we use a shared encoder for speed up?