Adding a child class of hf's rotary embedding to make hf generate work on multiple gpus.

mosaicml / llm-foundry

LLM training code for Databricks foundation models

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Apache License 2.0

3.99k stars 525 forks source link

Adding a child class of hf's rotary embedding to make hf generate work on multiple gpus. #1334

Closed ShashankMosaicML closed 3 months ago

ShashankMosaicML commented 3 months ago

Multi gpu generation using hf.generate with device map = 'auto' does pipeline parallelism and moves different modules to different gpus. We see that for some reason, rope embedding's inv_freq buffer does not get moved to the right device. In this PR we explicitly move it to the correct device. This should not slow down training because during training this tensor movement should be a no-op.