Running train.py on 2060 GPU

karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

MIT License

34.49k stars 5.31k forks source link

Running train.py on 2060 GPU #8

Open lzeladam opened 1 year ago

lzeladam commented 1 year ago

"Hello! I've been trying to run the train.py on a 2060 GPU, but this device does not support dtype=torch.bfloat16. What changes would I have to make to achieve my goal? Or can I only train on an Ampere architecture GPU for now? Thank you very much for sharing this project!"

karpathy commented 1 year ago

Two options:

use dtype=torch.float32 to disable mixed precision training. Will work on anything, but slow.
used dtype=torch.float16 to use fp16 instead of bf16. Because the range of fp16 is small this requires addition of gradient scaler. It's only a few lines of codes. I'm not sure if I should add support for it in stock train.py. Potentially the answer is yes. I didn't do so so far because I didn't want to bloat the training file with more options, but this might be common enough that it is worth it. Thinking it through...

lzeladam commented 1 year ago

H @karpathy,

Thank you for your help, I made the change and now I have some problems detecting the CUDA in my WSL environment:

debug_wrapper raised RuntimeError: CUDA: Error- no device

I don't know why because the GPU is detected with nvidia-smi command:

so, I will try to solve it

jcherrera commented 1 year ago

What are the min requirements to run nanoGPT?

lzeladam commented 1 year ago

@jcherrera try to change this parameters batch_size = 12 by 16 Block_size = 1024 by 512

Note: This project doesn't work in windows because pythorch 2.0 by now only support Linux. Another alternative is pay a A10 or A100 instance in Lambdlabs.com ...maybe I'll could do a post 🤔

adammarples commented 1 year ago

@jcherrera set

compile = False # use PyTorch 2.0 to compile the model to be faster

in train.py

jorahn commented 1 year ago

To add one data point: I'm running unmodified python train.py with --batch_size=8 on ~22gb vram.