MultiGPU - Crash when training

sowson / darknet

Darknet on OpenCL Convolutional Neural Networks on OpenCL on Intel & NVidia & AMD & Mali GPUs for macOS & GNU/Linux & Windows & FreeBSD

http://pjreddie.com/darknet/

Other

187 stars 31 forks source link

MultiGPU - Crash when training #13

Closed yaoanderson closed 5 years ago

yaoanderson commented 5 years ago

Hi @sowson , I found sometimes the program will crash when training my network, but I am not sure why, Have you any idea about my issue ?

yaoanderson commented 5 years ago

another question: I see you have mentioned "-gpus 0,1" parameter in another crash issue, what is the difference between "-gpus 0,1" and "-i 1" and how can I use them ?

My case: I can hardly do any thing except for training when I train with "-i 1" due to too slow response from my Mac. But I can do other thing when I train with "-gpus 0,1", I do not know why ? It seems training just use one 0 intel gpu but not 1 amd gpu when use "-gpus 0,1".

sowson commented 5 years ago

Hi, this is known issue the thing is you have to have identical GPUs and then it works :D. The nature of the problem is that the OpenCL queue is only one and it works with multiple GPUs only when kernels for devices are compiled in exactly in the same way. Then and only then it works! 🗡

See my DreamPC setup:

And how it works:

Thanks! You may close this because it is not possible what you request on your HW.

yaoanderson commented 5 years ago

Sorry I have confused and not understand why it is not possible what you request on your HW yet.

sowson commented 5 years ago

Hi, it is possible when you have 2+ exactly the same GPUs. HW - hardware I am mean. Thanks!

yaoanderson commented 5 years ago

I got it thanks sowson.

yaoanderson commented 5 years ago

For my crash issue ?

sowson commented 5 years ago

Before we Close... there is one more thing... I was not 100% right, you may use in Makefile or CMakeLists.txt switch that allows you use MULTI_GPU (that is the switch name), and than (after fix I just made) you will be able to really run MlutiGPU. But this disabling sgemm implementationn from clBLAS because those one is broken for access by many devices the same time. Instead of that I wrote my own sgemm (the trivial one, without optimization) it works and it shows that when clBLAS will be ready truth MultiGPU will be possible, now math will be correct by not optimized sgemm slow you down. Last thing is the macOS when you use -gpus 0,1 it slow down 0 GPU and you will suffer lack of responsiveness when you even use Console/Terminal. Thanks a lot you have very nice issues so far!

yaoanderson commented 5 years ago

hi @sowson I do not want to train my network by using default 0 intel gpu because it is too slow, but sometimes training will crash when I use -i 1 amd 4GB gpu, so please help me about this issue, or can I avoid this crash via some configuration or there is any other workaround ? My target is to train fast with my amd gpu without crash, please help me thanks

sowson commented 5 years ago

I tired to address this with 2 things, first some OpenCL setup in this project and second is https://github.com/sowson/clBLAS that is MultiGPU ready for this project. I even put in place a pull request. You may update source code I just pushed and then you may compile on your macOS mention clBLAS then it should improve a stability not only for multi GPU but overall. Thanks!

yaoanderson commented 5 years ago

ok, I will try thanks so much sowson

yaoanderson commented 5 years ago

sudo cmake and sudo make / install in clBLAS/src success:

Then I remake my darknet project success:

But, Unfortunate, still failed: Please help to check my step and failed issue @sowson

yaoanderson commented 5 years ago

This happened sometimes, and it is second priority issue, please help solve another loss value issue first. Thanks sowson

sowson commented 5 years ago

If no more issues here can we close this one and focus on the #16 instead? Thanks!

yaoanderson commented 5 years ago

ok, let me focus on another ticket.