Faster training? - Githubissues

ntelo007 commented 4 years ago

Hi guys!

Is there a way to train the multi-label classification example faster? Does the os.environ['CUDA_VISIBLE_DEVICES'] = '0' expression on top make sure that the GPU will be used?

So far, only my CPU is utilized for the training procedure.

Thanks for your time!

JordanMakesMaps commented 4 years ago

What you're saying there is that you want to specifically use the CUDA-enabled GPU device number 0. If you have multiple CUDA-enabled GPU devices, use this:

os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1, 2'

Once you import tensorflow or keras, you should see an increase in the amount of memory being allocated/shared. If you're using a Windows machine, you can open up 'Task Manager', go to the 'Performance' tab, and on the left-side scroll down to see the GPUs.

Like I said, you should see an increase in the "Dedicated GPU memory usage" after creating/compiling the model (see image below). Also sometimes your IDE will make a note of it in the log, if you use Jupyter Notebook you should see a message like the other image below.

During training, the CPU is used for data augmentation (if you're doing any), and the GPU should handle the rest of the workload. If you train and your CUDA-enabled GPU isn't being used, check to see which version of Keras and Tensorflow you're using (it should be gpu version, 'Tensorflow-gpu'). You should see pretty quickly if the GPU isn't being used because it should take FOREVER to train any model on just the GPU.

ntelo007 commented 4 years ago

That's exactly what is happening. I opened my task manager to see which resources are used and I also typed nvidia-smi in my cmd and I verified that no process was running on my GPU. Apart from the compatibility check of my installed libraries is there anything else that I need to do? Furthermore , can I use the keras.utils.multiple_gpu_model functionality to use more than just one GPU?

JordanMakesMaps commented 4 years ago

No, I don't believe so. Having the right version of Tensorflow-GPU and Keras-GPU is pretty important though so be sure about that you have those versions and those alone. If you have some other CPU versions it might try to switch to those by default. Also thethe proper versions of CUDA and cudnn.

Sidenote: Here's a link to a stackexchange you can check.

ntelo007 commented 4 years ago

No, I don't believe so. Having the right version of Tensorflow-GPU and Keras-GPU is pretty important though so be sure about that you have those versions and those alone. If you have some other CPU versions it might try to switch to those by default. Also thethe proper versions of CUDA and cudnn.

Sidenote: Here's a link to a stackexchange you can check.

I am using os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1, 2, 3'

and it seems that all GPUs are detected and created as TensorFlow devices. But unfortunately, the training time did not decrease, as I would expect.

Could you please tell me what I should do to utilize all 4 GPUs and save a tremendous amount of training time?

Thanks for your time.

JordanMakesMaps commented 4 years ago

Here, this might help. I'm not familiar with using multiple GPUs as I only have one that is CUDA-enabled.

How many images are you working with, what's the number of epochs until the model finally converges, are you using pre-trained weights or from scratch, and how large are your images? I

ntelo007 commented 4 years ago

I

Here, this might help. I'm not familiar with using multiple GPUs as I only have one that is CUDA-enabled.

How many images are you working with, what's the number of epochs until the model finally converges, are you using pre-trained weights or from scratch, and how large are your images? I

I am working with 150.000 images of size 256x256. I am not using pre-trained weights. I selected 50 epochs. During the second epoch the loss stopped dropping so much.

I know about the keras multiple model, I asked about it two comments above, but it didn't work. I don't know if it's because of how the models are implemented..

qubvel / segmentation_models

Faster training? #305