Export process requires more VRAM than the actual finetuning

hvico commented 7 months ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

accelerate launch --config_file ./single_config_export.yaml --use_fsdp src/export_model.py --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --adapter_name_or_path mixtral --template default --finetuning_type lora --export_dir mixtral --export_size 2 --export_legacy_format False --quantization_bit 4

Expected behavior

Being able to export a model with the same VRAM I used to finetune it.

System Info

Linux

Others

Hi!

I finetuned a Mixtral 8x7 with lora q4 using 3 x RTX3090, using accelerate cli with --use_fsdp to make the trainer distribute the shards between GPUs.

But now I am having a similar problem that the one reported on this issue: it seems the export process requires more VRAM than the trarinig. Is it really like that? It seems to be loading more data on GPU 0 than the others and I get an OOM at shard 15 of 19 or so). Also tried to use cpu offloading as I have 128 GB RAM, but in that case I get this error at the end of the process (after the offload folder reaches the full model size, 96 GB):

https://stackoverflow.com/questions/77547377/notimplementederror-cannot-copy-out-of-meta-tensor-no-data

Is it any way to workaround this or this is expected? (requiring more VRAM to export the model than the actual finetuning). Any other way to export it?

Thanks!

hiyouga commented 7 months ago

do not use accelerate or fsdp in exporting model, refer to this example: https://github.com/hiyouga/LLaMA-Factory/blob/main/examples/merge_lora/merge.sh

hvico commented 7 months ago

Hi. Thanks for your response.

Tried to run without accelerate on CPU like in that example. I get the error I was refering from StackOverflow, after it offloads the whole thing to the offload folder:

File "/bigdata/LLaMA-Factory/src/llmtuner/model/loader.py", line 148, in load_model_and_tokenizer model = load_model(tokenizer, model_args, finetuning_args, is_trainable, add_valuehead) File "/bigdata/LLaMA-Factory/src/llmtuner/model/loader.py", line 93, in load_model model = init_adapter(model, model_args, finetuning_args, is_trainable) File "/bigdata/LLaMA-Factory/src/llmtuner/model/adapter.py", line 110, in init_adapter model: "LoraModel" = PeftModel.from_pretrained( File "/home/.virtualenvs/finetuning-mixtral/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs) File "/home/.virtualenvs/finetuning-mixtral/lib/python3.10/site-packages/peft/peft_model.py", line 760, in load_adapter dispatch_model( File "/home/.virtualenvs/finetuning-mixtral/lib/python3.10/site-packages/accelerate/big_modeling.py", line 384, in dispatch_model offload_state_dict(offload_dir, disk_state_dict) File "/home/.virtualenvs/finetuning-mixtral/lib/python3.10/site-packages/accelerate/utils/offload.py", line 98, in offload_state_dict index = offload_weight(parameter, name, save_dir, index=index) File "/home/.virtualenvs/finetuning-mixtral/lib/python3.10/site-packages/accelerate/utils/offload.py", line 32, in offload_weight array = weight.cpu().numpy() NotImplementedError: Cannot copy out of meta tensor; no data!

If I use CUDA with the 3 x 3090 I get an OOM.

hvico commented 7 months ago

OK, got the merge working by renting a dual A6000 hosting and using CUDA.

CPU export is not working as expected, so I am not able to export my finetunes from my local setup.

hiyouga commented 7 months ago

Yes, the finetuned model can be exported using 2 A6000s with CUDA by directly using the python launcher

hvico commented 7 months ago

OK, thanks. It would be great to have the CPU export fixed at some point, as the 3x3090 setup is more than sufficient for finetuning (I can even use large batches), but the error I reported prevents to do the export on the same machine using CPU (and it seems CUDA is not possible for export because the OOM).

Thanks for your nice work on this tool and your quick responses.

Regards,

hiyouga commented 7 months ago

you can try --low_cpu_mem_usage False option to get the model exportable in CPU

hiyouga / LLaMA-Factory