[BUG] Distant computer (4 GPUs 10GiB of VRAM each) crashes the second i launch the finetuning using AutoTrain localy of mistral 7B

hichambht32 commented 4 months ago

Prerequisites

[X] I have read the documentation.
[X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain --config config.yml

UI Screenshots & Parameters

This is my CLI when the distant computer crashes : Capture d’écran (61) This is my Config.yml file :

Error Logs

This is the error in BitVise when the computer crashes Capture d'écran 2024-05-10 100356

Additional Information

Hello Everyone, im a newbie to AutoTrain, im trying to finetune a 7B mistral param using AutoTrain localy (since i got 4 gpus with 10 GiB of Vram each) in a distant computer for which im using BitVise to connect via SSh. i was using the SFT Trainer of Huggingface and faced many issues related to CUDA memory for that reason i decided to move to Autotrain since its easier. N.B : I have done everything i know to optimize (LoRa, 4bit model loading, lower batch size, etc...) i also tried the DDP before (Distributed data paralel). I have found some memory usage estimation of fine tuning this same model in BF16 precision that says the peak of Adam is 56gb of Vram which i dont have for the moment. Do you think renting some GPUs from RunPod or Huggingface would fix the issue for me ? Thank you for your guidance.

abhishekkrthakur commented 4 months ago

are you running the autotrain --config command on the same computer or a remote pc?

hichambht32 commented 4 months ago

Hello sir, of course in a remote pc, im using the shell from bitvise

abhishekkrthakur commented 4 months ago

maybe your gpus are not supported. which gpu do you have? can you try with quantization: null?

hichambht32 commented 4 months ago

i couldnt run the nvidia smi command since it still crashing, (i will need to restart it when im there) but i found a recent screenshot of the CPU and GPU on my distant computer : WhatsApp Image 2024-05-21 à 15 37 30_850f4462 Capture d'écran 2024-05-13 113751

hichambht32 commented 4 months ago

I will try using quantization: null soon, although I doubt it will fix the error. Based on my experience, the minimum effective level of quantization is 4-bit. Setting it to null will not quantize the model at all.

abhishekkrthakur commented 4 months ago

if you cant run nvidia-smi, then thats the possible cause. you will need to restart the remote machine :)

hichambht32 commented 4 months ago

No, you didnt get it. its not about i can not run the nvidia-smi. the whole ssh conneciton is broken but once i restart it everything works just fine i can run whatever i wish (including nvidia-smi). But the problem is the second i launch the finetuning with the autotrain -- config config.yml command, the machine crashes

hichambht32 commented 4 months ago

Can you tell me whether my gpus are supported ? Maybe they are insufficient right ?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 20 days since being marked as stale.

huggingface / autotrain-advanced