redotvideo / haven

LLM fine-tuning and eval
https://haven.run
Apache License 2.0
341 stars 11 forks source link

how many gpus needs when full training 70B Llama2 #81

Open alphanlp opened 1 year ago

alphanlp commented 1 year ago

how many gpus needs when full training 70B Llama2

uniAIDevs commented 8 months ago

Training or running inference on large language models like LLaMA 2 70B requires substantial computational resources, particularly in terms of GPU memory (VRAM). The LLaMA 2 70B model, with its 70 billion parameters, presents specific challenges due to its size. Here's a detailed breakdown of the requirements and considerations for running this model:

GPU Memory Requirements

The LLaMA 2 70B model, when fully loaded into memory, requires a significant amount of VRAM. A single fp16 parameter (the common data type used for such models) requires 2 bytes of memory. Therefore, loading the entire LLaMA 2 70B model would require approximately 140 GB of memory (70 billion parameters * 2 bytes)[2]. This far exceeds the capacity of any single consumer-grade GPU.

Quantization and Mixed-Precision Techniques

To fit the model into consumer-grade GPUs, quantization techniques can be applied to reduce the precision of the model's parameters, thus reducing the overall memory footprint. For instance, quantizing the model to 4-bit precision reduces the memory requirement to about 35 GB (70 billion * 0.5 bytes)[2], which is closer to, but still above, the capacity of high-end consumer GPUs like the NVIDIA RTX 3090 or 4090, each with 24 GB of VRAM.

Further reduction to 3-bit precision brings the requirement down to approximately 26.25 GB[2], which still doesn't fit into a single GPU but is more manageable across multiple GPUs or with certain memory offloading techniques.

Practical GPU Configurations

For practical purposes, running the LLaMA 2 70B model efficiently would require:

Conclusion

Running the full LLaMA 2 70B model requires careful consideration of hardware capabilities and potentially significant investment in high-end GPUs or multiple GPU setups. Quantization and mixed-precision techniques are essential for reducing the model's memory footprint, but they also require expertise to implement effectively without significantly compromising model performance[2]. For those without access to such hardware, cloud-based solutions or model distillation techniques might offer more feasible alternatives.

Citations: [1] https://news.ycombinator.com/item?id=37067933 [2] https://towardsdatascience.com/run-llama-2-70b-on-your-gpu-with-exllamav2-588141a88598 [3] https://www.reddit.com/r/LocalLLaMA/comments/184qfeg/chassis_only_has_space_for_1_gpu_llama_2_70b/ [4] https://huggingface.co/blog/ram-efficient-pytorch-fsdp [5] https://www.hardware-corner.net/guides/computer-to-run-llama-ai-model/ [6] https://github.com/Lightning-AI/lit-gpt/issues/456 [7] https://www.interconnects.ai/p/llama-2-part-2 [8] https://hpc-ai.com/blog/70b-llama2-training