Closed Neo9061 closed 1 day ago
Hi @Neo9061, thanks for reporting this issue and providing the detailed context. I'm maintainer of bitsandbytes and have a few thoughts:
Regarding issue 1)
From the code you provided, you don't seem to be using any parallel training approach, like FSDP. Is that right? In that case it would be expected that you cannot use the full memory of all combined GPUs and therefore it would be expected that you run OOM.
You mentioned that it "should run". To qualify this with a concrete estimate, I did the following back of the envelope math:
Memory consists of gradient, parameters, and optimizer state (all parameter-dependent) and activation memory. Activation memory peaks at the memory for a transformer block when using gradient checkpointing, as that is the granularity of gradient checkpointing. With LoRA, gradient and optimizer state is approximately 1% of the parameter size, and parameters in 4-bit precision are just half a byte.
For parameter-related memory:
(gradient+optimizer)*0.01 + (parameters) = (p*4 bytes + p*8 bytes)*0.01 + p*0.5 bytes
Activation memory is more complex, but an approximation with a good implementation is:
seqlen*hidden_size*2bytes*batch_size*8
With a suboptimal implementation, the upper bound is approximately:
seqlen*hidden_size*2bytes*batch_size*16
Calculating the parameter-related memory for a 300 billion parameter model:
(300 billion * 4 bytes + 300 billion * 8 bytes) * 0.01 + 300 billion * 0.5 bytes = 186 GB or 173.2 GiB
However, 186 GB doesn't yet account for the activation: I'm not certain what values to plug in for the activation calculation, but with 186 GB used out of the total 192 GB combined capacity of your 8 GPU AWS instance (g5.48xlarge
), this leaves relatively little wriggle room. Could you please find out the missing values in your training configuration and we can refine the estimate from there?
Optimizer states and gradients can also potentially have lower precisions, so it would be important to take that into account (i.e. confirm our calculation assumption or use less precision to save memory in one of your tests).
On a side-note, not necessarily related but I want to mention it anyways:
We have observed an issue with QLoRA and long seqlen
leading to higher than expected memory usage when using BNB quantization and suspect a memory leak. However, I don't see the seqlen
specified in your minimal reproducer code. Could you please confirm what sequence length you are using?
I investigated this briefly after we became aware that there might be a memory leak leading to excessive memory consumption for high seqlen
during the FSDP + QLoRA release but couldn't easily reproduce the problem. Due to limited resources and other high-priority items, we temporarily deprioritized further investigation. Our assumption was that this actually only affects really few users, but would like to know in case it's perceived as a blocker for prioritization.
We have a new engineer, @Matthew Douglas, joining the BNB team in July. Once he's on board, we plan to reassess the importance and urgency of this issue. It would be helpful if you would could look into this a bit and if you think it's a blocker, ping us again.
Regarding issue 2)
I think this is more for @philschmid and the others to answer, as nothing immediately catches my eye.
Hi @Titus-von-Koeller, thanks for answering my questions!
To follow-up, my first issue is OOM during model loading stage, not the model fine-tuning stage. I followed the blogpost https://www.philschmid.de/sagemaker-train-deploy-llama3 which initializes FSDP and here is the training script.
Q1. By this line https://github.com/huggingface/transformers/blob/dd4654eab7593be34294dc16279f52e4efa8869e/src/transformers/modeling_utils.py#L4217-L4232 i feel that if the model is quantized, then it uses GPU to load the model rather than loading the model into CPU using low_cpu_mem_usage=True
(is that True? and it seems uses single GPU - rank 0 - to load the entire quantized models).
On the other end, by using @philschmid 's blogpost and code, I am able to load and train two models:
Neither g5.12xlarge
(4 GPUs with 96 GB in total, and 192 GB CPU) nor g5.16xlarge
(1 GPU with 24 GB and 256 GB CPU) has enough GPU memory on single GPU to load the model, thus I suspect you are doing offloading to CPU memory rather than using single GPU - rank 0 - to load the quantized models. But then it does not explain why Grok-1 model is failed at loading stage with 4 bit quantization.
Q2. For the memory you computed, is it for model loading or model fine-tuning? my understanding is that the memory for activations, optimizers, and gradients are all not required at model loading stage by .from_pretrained
method. Is my understanding correct?
Q3. For fomula of activations: seqlen*hidden_size*2bytes*batch_size*16
if I have seqlen of 1024, hidden_size of 8000, batch size of 1, then the total memory is 262144000 / (10^9) = 0.26 GB which is negligible if we don't use high batch size. Is my understanding correct?
cc @matthewdouglas who has taken over the lead on this task
Up! Are there any suggestions on this issue??
@Neo9061 @thepowerfuldeez See the PR on #32276. The observation here is that weights would be offloaded to CPU memory for all ranks instead of just one (e.g. 8x CPU memory requirement on the g5.48xlarge and p4d.24xlarge mentioned in the original issue). This usage goes back down after the model is loaded, so a temporary workaround could be to create additional swap space on local NVMe storage.
In addition to this, I'm testing out some further changes to enable the usage of prequantized checkpoints with FSDP+QLoRA.
Since the PR was merged but then reverted, @matthewdouglas is there another PR we can follow for this feature ?
New PR: #33154
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
New pr was merged, closing
System Info
My goal is to follow Distributed fine-tuning blogpost with FSDP to test with distributed fine-tuning on larger size of model like 300B Grok-1.
Context is that I have tried g5.48xlarge (8 GPUs with 192 GB and 768 GB CPU) and p4d.24xlarge (8 GPUs. with 320 GB and 1152 GB CPU). There are two issues listed as following.
Transformer version is:
transformers==4.40.0
Issue 1 When I tried to load the model with 4 bits quantization with code below (WITHOUT FSDP and it is purely on a EC2 of g5.48xlarge), the total GPU memory required should be around 150GB (since model is ~300B Grok-1), which is smaller than 192GB GPU memory of g5.48xlarge, but I hit OOM. If I turn on
low_cpu_mem_usage=True
, then the model can be successfully loaded on CPU in the EC2 of g5.48xlarge. Same error happens atp4d.24xlarge
where 4 bit quantization is failed at loading.Issue 2
Continue on point 1, i think I find a path forward to load the model into CPU by setting
low_cpu_mem_usage=True
. Follow the blogpost above, I start try SageMaker training job and I try to load this model using the default qlora_fsdp script, shown in the blog. Further, I disabled the quantization (as the quantization will load the model into GPUs but it failed in the point 1). Since when FSDP is enabled, it will by default uselow_cpu_mem_usage=True
according to this line. However, I hit timeout issue even after I modified training argumentddp_timeout
to be 10800.The model checkpoints are loaded twice and failed at second time of loading.
Who can help?
@philschmid @SunMarc @lewtun @sgugger @ArthurZucker @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Same as above
Expected behavior
Should be no OOM