Training multiple models on a single GPU

sheffier commented 2 years ago

Hi,

Can you give a code example of how to utilize FFCV for training multiple models on a single GPU?

GuillaumeLeclerc commented 2 years ago

Hello,

We don't have an open source snippet at the moment but it is really straightforward:

Start a single python script
In this script create multiple threads (not processes!)
Each thread create its own loader and trains

PS it seems that there is a bug with cudnn and in some situations it will crash if two BatchNorm are issued at the same time. The workaround is to put a lock shared among all your threads to ensure that no thread run the forward pass on the model at the same time (this should not slow down training since a call to forward is non-blocking)

sheffier commented 2 years ago

Thanks for the quick replay!

One question though. Won’t this kind of solution suffer from the GIL?

GuillaumeLeclerc commented 2 years ago

If you avoid non ffcv augmentations it usually is fine. Ffcv runs outside of the gil. The only problem could be the model of you have a lot of fast layers that run faster than it takes to schedule them

sheffier commented 2 years ago

Ok, I’ll give it a shot

libffcv / ffcv

Training multiple models on a single GPU #183