lucidrains / routing-transformer

Fully featured implementation of Routing Transformer
MIT License
282 stars 29 forks source link

Batch size 1 #12

Closed matthew-jurewicz closed 4 years ago

matthew-jurewicz commented 4 years ago

Can't find batch normalization in the code, so presumably this works just fine with batch size 1?

lucidrains commented 4 years ago

oh hey! so transformers don't use batch norm, they use layer norm, so any batch size is ok! I would do batch size x gradient accumulation = 32 at least

matthew-jurewicz commented 4 years ago

Oh wow, I never knew that!