Multi gpu generation using hf.generate with device map = 'auto' does pipeline parallelism and moves different modules to different gpus. We see that for some reason, rope embedding's inv_freq buffer does not get moved to the right device. In this PR we explicitly move it to the correct device. This should not slow down training because during training this tensor movement should be a no-op.
Multi gpu generation using hf.generate with device map = 'auto' does pipeline parallelism and moves different modules to different gpus. We see that for some reason, rope embedding's
inv_freq
buffer does not get moved to the right device. In this PR we explicitly move it to the correct device. This should not slow down training because during training this tensor movement should be a no-op.Related to this: https://github.com/mosaicml/llm-foundry/pull/1332