Question about parallel computing on GPU

Johnxiaoming commented 1 year ago

I have two GPUs on our server. But when I run the model, the second GPU always does not work. I know my job is not very big, so that the first GPU hasn't been occupied fully. But I think if you can support parallel computing, it can save much more time. Even just adding a function that can let us select the GPU I want to use, it will also help a lot. Because I can run two jobs on each GPUs. Thanks.

If you feel difficult, I will fork it and do it by myself.

kevinjohncutler commented 1 year ago

@Johnxiaoming I have been using two GPUs very heavily recently and both are utilized. Although the latest Github version should work, I have made enormous optimizations in the last two weeks that will post to github soon. I use DataParallel currently, and this works fine for single servers and AWS innstances. DistributedDataParallel might be better. To debug, what version of Omnipose and PyTorch do you have and what is your hardware?

Johnxiaoming commented 1 year ago

a tiny intel + double nvidia RTX GPU server. With cuda 11.8 + pytorch 2.0 cuda112py311h13fee9e_200 (python 3.11)

Thank you!

kevinjohncutler commented 1 year ago

@Johnxiaoming I just realized that you said running the model, not training (that's where my head has been at...). I know for sure that training uses both GPUs, but evaluating is another story. Some of my optimizations for training should apply to evaluation. The model itself is initialized with dataparallel in both training and evaluation, so my guess is that we simply are not saturating GPU0, and so GPU1 never gets called. Can you tell me what your typical image set looks like in number and resolution? If you monitored your GPU0 memory, I am curious to know if its VRAM was completely used up or if was well under the maximum capacity during evaluation.

Some more explanation and planning: the behavior now is to process images in sequence. The batch_size parameter only applies to tiled mode now, where the image is split into 224x224px patches and run in parallel. That should always be slower than running the whole image, so long as the full image fits on the GPU. I think the reason why Cellpose did not build in the ability to run multiple images at once on the GPU is because each image in the batch must be the same size (this is guaranteed during training via cropping), and they were typically evaluating on very diverse datasets. However, in real applications we usually have the same size images from a given sensor or even a cropped time lapse, so it makes a LOT of sense to run whole images in batches.

Moreover, it makes sense to run the mask reconstruction in parallel - again, VRAM permitting. Doing the Euler integration in one loop for all images simultaneously instead of multiple loops in sequence is virtually guaranteed to be faster. I already figured out much of the code for this to parallelize training, and I can see exactly what we need to do for evaluation. I just need to find the time to implement it. I suspect I will do it by the end of the month, so stay tuned!

kevinjohncutler / omnipose

Question about parallel computing on GPU #46