[FEATURE]: Better optmizer for llm training

Describe the feature

I want to continue pre-training llama 2 70b using my own data. My data is about 1b tokens. I have read Fine-tuning Llama 2 70B using PyTorch FSDP . In this blog, the authors said the minimal hardware requirement is 8 a100 80GB node. It seems too cost for me. Is there any method to train with less resources? Here the resources I mean mainly the GPU memory.

$ accelerate estimate-memory meta-llama/Llama-2-70b-hf

┌────────────────────────────────────────────────────────┐ │ Memory Usage for loading meta-llama/Llama-2-70b-hf │ ├───────┬─────────────┬──────────┬───────────────────────┤ │ dtype │Largest Layer│Total Size│ Training using Adam │ ├───────┼─────────────┼──────────┼───────────────────────┤ │float32│ 3.19 GB │256.29 GB │ 1.0 TB │ │float16│ 1.6 GB │128.15 GB │ 512.59 GB │ │ int8 │ 817.02 MB │ 64.07 GB │ 256.29 GB │ │ int4 │ 408.51 MB │ 32.04 GB │ 128.15 GB │ └───────┴─────────────┴──────────┴───────────────────────┘

I have searched some papers and related methods.

better distributed algorithms This includes 1). DeepSpeed Zero DP and pytorch FSDP; 2). DeepSpeed Zero Offload 3). Pipeline Parallel 4). Tensor Parallel. As for less GPU memory. All of these methods can't cut down GPU memory from 512GB to 256GB except CPU offload. But it will slow down training speed very much.
quantization Many quantization methods have been proposed to speed inference. But None can be used for pretraining for full fine tuning. Some methods such as QRoLA can be used to fine tuning with PEFT. But that's not my goal. It seems that training must at least 16bit(mixing precision training).
memory efficient optimizer Adam seems to be the standard optimizer for llm training. There are many reasons including: 1) Other people do it successfully so I must do like them. 2) Comparing with sgd, it need less hyper parameter tuning. 3) Training LLM is very time(money) consuming, so don't waste time try other method(this reason is somewhat like number 1). But in the deep learning community(esp CV), many think sgd may find better solution. As DeepSpeed Zero paper says. Most of the memory is used by optimizer states. Even using adaptive optimizer, there have been some memory efficient optimizers proposed but not gain interesting of llm comunity. lion optimizer lomo optimizer Adafactor already implemented by pytorch. CAME: Confidence-guided Adaptive Memory Efficient Optimization code here

So In my opinion, using memory efficient optimizer to replace adam/adamw is a good direction. Could colossalAI integrate better optimizer with fsdp/deep speed? Because for 70b model, even the parameter can not be loaded in a gpu. So we need model parallel or parameter sharding.

hpcaitech / ColossalAI

[FEATURE]: Better optmizer for llm training #4964

Describe the feature