lucidrains / x-transformers

A simple but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.42k stars 377 forks source link

Init bias=0 in to_logits #220

Closed ad8e closed 6 months ago

ad8e commented 6 months ago

This bias randomly skews the initial distribution, which does not have a productive purpose.

Tested minor improvement in PPL with this change, but improvement was not apples-to-apples. Might also be noise.

lucidrains commented 6 months ago

@ad8e actually, new papers have shown the bias for projecting to logits is not necessary at all

i should probably just remove it altogether and bump a minor version

ad8e commented 6 months ago

Sounds fine to me; should be similar to removing bias on FF layers.

lucidrains commented 6 months ago

@ad8e FF is debatable actually

there are people on the google brain vision team that swears by it

lucidrains commented 6 months ago

but i think anywhere else, biases are not needed

lucidrains commented 6 months ago

@ad8e ok, i went ahead and removed a bunch of biases in 1.27.0

thanks for bringing this up. been meaning to get around to it

lucidrains commented 6 months ago

as they say, "perfection is when there is nothing left to remove"

lucidrains commented 6 months ago

@ad8e yeah i've seen one other paper that does this, but still waiting for a bigger model to adopt it before switching.

i understand the question of 'whether it scales' is kind of chicken and egg, but it is better to be cautious.

ad8e commented 6 months ago

Got it; I deleted my comment on removing scaling on layernorm before receiving your message because I found an error in my setup.

lucidrains commented 6 months ago

@ad8e oh no problem, yea rerun your experiments and share your results

curious what you see on your end

lucidrains commented 6 months ago

i've also seen papers that claim multiplicative gamma to help (vision transformers literature), even if it can be fused with the next projection, so beats me whether to keep it or not

ad8e commented 6 months ago

Initial loss curve is slightly worse with layernorm scaling removed: https://wandb.ai/team-ad8e/tinystories2?workspace=user-ad8e. Later, the curves approach each other.

No difference in PPL between 1.27.0 and 1.26.0.

lucidrains commented 6 months ago

@ad8e nice

let your experiments run for a bit longer though; 1.5k steps is nothing

ad8e commented 6 months ago

On my non-xtransformers char transformers, bias in to_logits provides a marginal but consistent boost. I assume this is because bytes have a wide range of probabilities, which isn't the case for BPE.

Still only 1.6k steps, comparing "from scratch" to "from scratch, bias on to_logits disabled": https://wandb.ai/team-ad8e/bisect?workspace=user-ad8e

I tried to reproduce x-transformers from scratch by matching architecture and init, but wasn't successful. My x-transformers model has a hump in its early loss curve that I couldn't reproduce with my other model. I eventually gave up and dumped my progress here, and I don't expect anyone to spend effort on it. In any case, that's why the above link has both non-xtransformers and x-transformers models. The hump is temporary; the loss curves equalize after step 400.

My model runs 10% faster on the Nvidia T4 than x-transformers, maybe because it's less flexible.