mrc-ide / covid-sim

This is the COVID-19 CovidSim microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London.
GNU General Public License v3.0
1.23k stars 256 forks source link

Consider the use of GPGPU (like CUDA) for performance #203

Closed Feynstein closed 4 years ago

Feynstein commented 4 years ago

I still don't know how it could be brought into the code but as the French Canadian expression goes: "Il y a toujours moyen de moyenner"

At my university there are a few grappes.. hum... clusters.. of GPU's that we can use in research, I assume it's the same thing with your college. Or if it's not I can say that these guys might be able to help (me) if its for important epidemiology modelling.

https://www.calculquebec.ca/en/

or

https://www.computecanada.ca/

They also might be able to provide raw CPU power... I'm thinking of this one: https://en.m.wikipedia.org/wiki/CLUMEQ

Which I think I still have my account, I'm gonna see if it still exists and if you want this kind of help I'll see what I can do.

Annnnddd now I'm done... No more issues. I'll work on what I have and wait for an update on your part. Thanks, and let's try to make this better shall we?

matt-gretton-dann commented 4 years ago

Thank you for raising this issue.

I have not investigated GPU offload in any depth. I believe that with a modern enough compiler you should be able to configure it to get OpenMP to offload to the GPU, and so the code as is should be able to make use of a GPU. However, I do not expect this to provide a significant performance gain because the model is doing significant amounts of control-dependent flow which I would not expect to perform well on a GPU without significant reworking of the model.

Feynstein commented 4 years ago

@matt-gretton-dann I see. But I think that maybe later on something might be possible to do regarding this, coding directly in Cuda C is much more efficient. I did a x-ray image simulator that takes a few seconds to generate a 3104x3104 16bit image with a model that has 100k+ triangles. I'll try to point out the places where I think it could be implemented.

I also wanted to know if you wanted the computing help I proposed. Thanks :-)

zebmason commented 4 years ago

“Premature optimization is the root of all evil” Donald Knuth I'd be tempted to take the code back to single threaded, refactor it then optimise based on profiling. Unless compilers have advanced enough since I last did it you would be optimising for one machine.

Feynstein commented 4 years ago

@zebmason when you work with CUDA C directly most of the times you need to start with it because the mentality is very different and its harder to to port kernels that are already written on CPU. Its funny because my computer science boss used to say the exact same thing. But when I started writing the kernel for my simulator he quickly realized I was not kidding. In fact as of right now he still doesnt understand everything thats going on in it... because its very intricated with how x-rays interact with matter and are attenuated. Thats why science-y CUDA code must be started with this language in mind. And my boss now agrees with me lol.

robscovell-ts commented 4 years ago

@zebmason I'd take it a step further than your suggestion: re-factor with thread safety a top level priority, to mitigate determinacy issues when running on multiple cores.

weshinsley commented 4 years ago

Gradually cleaning up our old issues. I don't think we'll implement any code against this issue because...

1) We have other models in the group that use GPU; it's not an easy task, and not all algorithms lend themselves to GPU. I agree with the comment above that porting to GPU is a very different thing to targetting it from the outset. My feeling is getting the covid-sim to map well onto GPU execution would be enormous effort for little performance gain, considering the granularity of OpenMP usage sections, and the large memory layout - it doesn't feel intuitively like a GPU recipe to me. In any case, the code is not particularly slow to run - and increasing performance is not current priority or need.

2) The parallel sections in the main loop were already done strating from an optimal single-threaded position, and parallelism from there always had thread-safety and determinism in mind. I wouldn't recommend anyone de-parallelises it to single-threaded and tries repeating all that engineering again. I suspect that would be tricky work, but potentially looking for problems that aren't there, since we discussed the true determinism of the code amply in previous months.