lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.56k stars 4.52k forks source link

Finetuning of LLaMA does not work in any setting (mem, lora) #2117

Open sergsb opened 1 year ago

sergsb commented 1 year ago

I try to fine-tune lmsys/vicuna-7b-v1.3 model. I have a server with 8 NVIDIA RTX A4500 (20Gb), so in total, about 160Gb of GPU Memory.

When I try to train with mem I have OOM in the middle of training. I followed the steps that have been described in the README, but it does not help much. That's strange because 160Gb of memory should be enough.

When I try to train LoRa with QLoRa and ZeRO2, I have another error. AssertionError: zero stage 2 requires an optimizer. Does anyone know how one can fix it?

When I try to train LoRa wits ZeRO3 I have


  File "/home/sergeys/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: output tensor must have the same type as input tensor
    work = group._allgather_base(output_tensor, input_tensor)
           ^    ^    ^work = group._allgather_base(output_tensor, input_tensor)work = group._allgather_base(output_tensor, input_tensor)^

^^^^^^  ^  ^  ^   ^  ^  ^  ^  ^  ^  ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^RuntimeError^^: ^^output tensor must have the same type as input tensor^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mvuthegoat commented 1 year ago

AssertionError: zero stage 2 requires an optimizer: You need to have transformers>=4.31.0 version to fix this

surak commented 9 months ago

What is the settings you use for mem? I'm trying the same and getting the same problem with 4x A100 40gb