Multi node multi GPU sharding for inference / training Llama 405B

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://www.deepspeed.ai/

Apache License 2.0

34.65k stars 4.05k forks source link

Multi node multi GPU sharding for inference / training Llama 405B #5814

Open avianion opened 1 month ago

avianion commented 1 month ago

Hello

We are trying to use Deepspeed to load LLama 405b across 2 nodes, of 8 x H100 SXM each.

We want to shard the model across all 16 gpus so that the model will be loaded in shards of 50.6GB each and then run inference and training with the model.

Please provide a guide on exactly how to do this, as we have been unable to figure it out.

echo-yi commented 1 month ago

Can you shard the model across 8 gpus in a single node? I'm having trouble even using a single node.

huichentt commented 1 month ago

Hi, @avianion have you solve this issue? I have the same issue for llama405b across nodes inference on HPC cluster.