Closed hichambht32 closed 2 months ago
are you running the autotrain --config
command on the same computer or a remote pc?
Hello sir, of course in a remote pc, im using the shell from bitvise
maybe your gpus are not supported. which gpu do you have? can you try with quantization: null?
i couldnt run the nvidia smi command since it still crashing, (i will need to restart it when im there) but i found a recent screenshot of the CPU and GPU on my distant computer :
I will try using quantization: null soon, although I doubt it will fix the error. Based on my experience, the minimum effective level of quantization is 4-bit. Setting it to null will not quantize the model at all.
if you cant run nvidia-smi, then thats the possible cause. you will need to restart the remote machine :)
No, you didnt get it. its not about i can not run the nvidia-smi. the whole ssh conneciton is broken but once i restart it everything works just fine i can run whatever i wish (including nvidia-smi). But the problem is the second i launch the finetuning with the autotrain -- config config.yml command, the machine crashes
Can you tell me whether my gpus are supported ? Maybe they are insufficient right ?
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 20 days since being marked as stale.
Prerequisites
Backend
Local
Interface Used
CLI
CLI Command
autotrain --config config.yml
UI Screenshots & Parameters
This is my CLI when the distant computer crashes : This is my Config.yml file :
Error Logs
This is the error in BitVise when the computer crashes
Additional Information
Hello Everyone, im a newbie to AutoTrain, im trying to finetune a 7B mistral param using AutoTrain localy (since i got 4 gpus with 10 GiB of Vram each) in a distant computer for which im using BitVise to connect via SSh. i was using the SFT Trainer of Huggingface and faced many issues related to CUDA memory for that reason i decided to move to Autotrain since its easier. N.B : I have done everything i know to optimize (LoRa, 4bit model loading, lower batch size, etc...) i also tried the DDP before (Distributed data paralel). I have found some memory usage estimation of fine tuning this same model in BF16 precision that says the peak of Adam is 56gb of Vram which i dont have for the moment. Do you think renting some GPUs from RunPod or Huggingface would fix the issue for me ? Thank you for your guidance.