Open GarrettMerz opened 11 months ago
I'm wondering if this issue may be related: https://github.com/pytorch/pytorch/issues/20275
hey Garrett at Madison! beautiful city, still have fond memories of it (worked at Epic Systems for a year right out of college)
yup, i think i may have an idea of what's wrong, will throw in a fix in an hour
@GarrettMerz basically i'm incorrectly caching by the sequence length, but it should cache the longest sequence length and slice out any subsequent calls with shorter ones
@GarrettMerz want to give 0.5.0 a try and see if it still hangs?
Updating this with results: this largely seems to fix things. I still see hanging behavior in cases where max output length is large and the model does not produce an EOS token before hitting the max length- i.e., if it gets stuck outputting something nonsensical like "++++++++++...", which may happen in early epochs: the length of this bad output is then cached, which causes ROPE to slow down a lot- but capping the max output length at a reasonable size generally seems to mitigate this, which is a good enough fix for now.
@GarrettMerz sounds good, as long as it does not hang anymore
best with your research and life out in the midwest
May need to reopen this, it seems that things are still hanging! I'm going to try to investigate more to figure out when specifically it might be happening- I'm going to use the relative position encoding in the encoder only (not the decoder) and see if that helps at all.
hmm, yea, i'll wait for more info from your end
you are the only one reporting this
@GarrettMerz could you try turning off cache altogether? https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py#L82 just to confirm that it is indeed caused by the freqs caching and not something on your end?
Hi, I was also encountering this issue with the latest commit (v0.5.3). On my end, I can confirm that this is caused by caching. Setting cache_if_possible=False
worked for me.
I still seem to be encountering this issue in some cases- I'm investigating more to confirm it's not an implementation problem.
I'm seeing consistent hangs with 8-gpu accelerate
based DDP.
Hangs on NCCL comms after the first step if I'm using 3+ gpus.
My sequence lengths are both different per step and per accelerator.
No issue if cache is disabled, or with 2 gpus, or if I use a fixed sequence length.
@lunixbochs I see! thank you for this info
I'll try a way of standardizing the cache to same tensor shape across devices and ping you to give it a try when it is done
the rope cache in fairseq seems to work fine: https://github.com/facebookresearch/fairseq/blob/920a548ca770fb1a951f7f4289b4d3a0c1bc226f/fairseq/modules/rotary_positional_embedding.py#L28
@lunixbochs sounds good, I'll take a look only if I can't figure it out
@lunixbochs want to try 0.8.2 and see if it resolves your issue?
just updated, not hanging for me anymore. thanks!
if you see anything wrong with the loss curve, let me know, made some risky changes
Hi! I'm running an enc-dec transformer with ROPE in the first self-attention layer of the encoder and decoder. I'm noticing that in the eval stage of my model, it hangs until my job times out after about 7 epochs; when running without this package, i.e. standard learnable positional encoding or shaw-style relative postion encoding, I do not notice this behavior. Are there any obvious places in this package that might lead to a memory leak or similar (i.e. does the register_buffer play nice with with_no_grad()?)