Closed ad8e closed 6 months ago
@ad8e actually, new papers have shown the bias for projecting to logits is not necessary at all
i should probably just remove it altogether and bump a minor version
Sounds fine to me; should be similar to removing bias on FF layers.
@ad8e FF is debatable actually
there are people on the google brain vision team that swears by it
but i think anywhere else, biases are not needed
@ad8e ok, i went ahead and removed a bunch of biases in 1.27.0
thanks for bringing this up. been meaning to get around to it
as they say, "perfection is when there is nothing left to remove"
@ad8e yeah i've seen one other paper that does this, but still waiting for a bigger model to adopt it before switching.
i understand the question of 'whether it scales' is kind of chicken and egg, but it is better to be cautious.
Got it; I deleted my comment on removing scaling on layernorm before receiving your message because I found an error in my setup.
@ad8e oh no problem, yea rerun your experiments and share your results
curious what you see on your end
i've also seen papers that claim multiplicative gamma to help (vision transformers literature), even if it can be fused with the next projection, so beats me whether to keep it or not
Initial loss curve is slightly worse with layernorm scaling removed: https://wandb.ai/team-ad8e/tinystories2?workspace=user-ad8e. Later, the curves approach each other.
No difference in PPL between 1.27.0 and 1.26.0.
@ad8e nice
let your experiments run for a bit longer though; 1.5k steps is nothing
On my non-xtransformers char transformers, bias in to_logits provides a marginal but consistent boost. I assume this is because bytes have a wide range of probabilities, which isn't the case for BPE.
Still only 1.6k steps, comparing "from scratch" to "from scratch, bias on to_logits disabled": https://wandb.ai/team-ad8e/bisect?workspace=user-ad8e
I tried to reproduce x-transformers from scratch by matching architecture and init, but wasn't successful. My x-transformers model has a hump in its early loss curve that I couldn't reproduce with my other model. I eventually gave up and dumped my progress here, and I don't expect anyone to spend effort on it. In any case, that's why the above link has both non-xtransformers and x-transformers models. The hump is temporary; the loss curves equalize after step 400.
My model runs 10% faster on the Nvidia T4 than x-transformers, maybe because it's less flexible.
This bias randomly skews the initial distribution, which does not have a productive purpose.
Tested minor improvement in PPL with this change, but improvement was not apples-to-apples. Might also be noise.