I try to fine-tune lmsys/vicuna-7b-v1.3 model.
I have a server with 8 NVIDIA RTX A4500 (20Gb), so in total, about 160Gb of GPU Memory.
When I try to train with mem I have OOM in the middle of training. I followed the steps that have been described in the README, but it does not help much. That's strange because 160Gb of memory should be enough.
When I try to train LoRa with QLoRa and ZeRO2, I have another error. AssertionError: zero stage 2 requires an optimizer. Does anyone know how one can fix it?
When I try to train LoRa wits ZeRO3 I have
File "/home/sergeys/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2532, in all_gather_into_tensor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: output tensor must have the same type as input tensor
work = group._allgather_base(output_tensor, input_tensor)
^ ^ ^work = group._allgather_base(output_tensor, input_tensor)work = group._allgather_base(output_tensor, input_tensor)^
^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^RuntimeError^^: ^^output tensor must have the same type as input tensor^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I try to fine-tune
lmsys/vicuna-7b-v1.3
model. I have a server with 8 NVIDIA RTX A4500 (20Gb), so in total, about 160Gb of GPU Memory.When I try to train with
mem
I have OOM in the middle of training. I followed the steps that have been described in the README, but it does not help much. That's strange because 160Gb of memory should be enough.When I try to train
LoRa
with QLoRa and ZeRO2, I have another error.AssertionError: zero stage 2 requires an optimizer
. Does anyone know how one can fix it?When I try to train
LoRa
wits ZeRO3 I have