Open bkmi opened 8 months ago
@janfb wanted to see a side-by-side comparison of the two with LayerNorm and BatchNorm before making the change.
To comment on that issue, with BatchNorm
one should be very careful to always feed batches that represent "modes" in the same proportion at the original distribution. For example, for a binary classifier, it is invalid to evaluate the positive and negative batches separately during training as it therefore becomes enough to identify the class of a single element in the batch to decide the class of the entire batch.
Is your feature request related to a problem? Please describe. BatchNorm eliminates the iid assumption. That assumption is at the core of all objective functions we use in this library.
Describe the solution you'd like We should switch to using LayerNorm or GroupNorm wherever possible.
Describe alternatives you've considered Give people a choice, but it's a niche issue and tbh I see no reason to use BatchNorm except legacy concerns.
Additional context Typically classifiers have the option in this format:
use_batch_norm: bool = False
, we should simply change that touse_layer_norm
.