Separate malloc, compute and free

AnderBiguri commented 3 years ago

parallelproj is now in STIR (yay!).

However, due to STIR design, its still quite slow within it. This is because STIR breaks up the problem a lot before the time to call the GPU projectors, and by then, it only requests small chunks of e.g. forward projection views.

At its current stage, this means that each of these calls, we copy the entire image into the GPU, then forward project the entire detector (this is a STIR issue that it asks too much to the GPU code) and finally once returned to the CPU, just copies the chunks that needs.

Of course, this is super wasteful, most of the times we make copies that are not needed and we ask for computation that is wasted. We have ideas on how to improve this. The part where we ask for more computation than required is more of a STIR problem, but the unnecessary copies need to be handled via this code. Ideally, we'd like STIR to be able to ask for few forward projections from an image that is already stored on the GPU. An easy solution to this would be for STIR to store the pointer of the GPU memory image, and then call the GPU code with an already allocated and copied GPU image.

All this to say: it would be great to have the forward and backprojection codes separated into 3 functions: allocate+copy, compute, free. And to have all these 3 functions exposed.

All this also applies to the backprojection.

gschramm commented 3 years ago

Hi Ander,

I completely agree. The design is suboptimal for many toolboxes (I was just lazy in my initial implementation). I will think about + work on this in the coming weeks.

Georg

AnderBiguri commented 3 years ago

Hi Georg,

Ah no worries (TIGRE is designed like this too, so I can't really criticize you :) ). Was adding this also to keep track of it, as the current times within STIR are quite bad, and we would definitely want STIR to improve! Hopefully its an easy fix, let me know if you need some hand with some bits, albeit I am quite swamped with my own things myself, so can't promise I can deliver a lot! :)

Ander

gschramm commented 3 years ago

Hi Ander,

so from the conceptual point of view you would: (1) malloc function: -> malloc's + copies: img, voxsize, img_dim, img_offset to GPU -> somehow return 4 device pointers (2) compute function: -> mallocs + copies: LOR coordinates, output array -> invokes kernel (using also the device pointers from (1)) -> frees LOR coordinates, output array (3) free function: -> free image device pointers

Should (1) return a structure containing all the device pointers or should (1) use the output pointers as input arguments as well?

AnderBiguri commented 3 years ago

I think it would be useful to fragment this as much as possible, but this may need a conversation of us with @KrisThielemans. Depends how STIR handles the data, maybe the LORs malloc and/or memcopy should be on its own function.

About your question: I think this is a preference thing. I find it easier when functions don't make their own structures as it makes easier to integrate, so the second, but this tends to be much harder to read as code. But I am a bit of a reductionist for these things to be honest, maybe @KrisThielemans would prefer a struct.

To be fair, if we go with the struct, maybe its worth having a small class that handles per GPU data and has malloc/copy/free methods. But suddenly make this codebase OOP when it was not. This is one of the reasons to prefer just raw pointers from me too, reduces repo complexity.

gschramm commented 3 years ago

I would also be in favor of not defining custom structs / classes to keep it basic (and have a bit longer code). But I am not sure if we should split the malloc/sending of the LOR coordinates as well since they will for sure change for every projection, unless you want to project different images along the same LORs.

AnderBiguri commented 3 years ago

I was thinking more in the lines of:

malloc(image)
malloc(LORs)
fwd(image,half_LORs)
somethingelseonGPU()
fwd(image,other_half_LORs) // takes LOR memory pointers, no need to reallocate anything. just compute
free_stuff()

This is very far from the current goal for STIR and likely your code, but if you were to run OSEM fully on GPU, it would make sense to have only 1 malloc/memcopy of LORs

gschramm commented 3 years ago

Hi Ander, I made a first implementation of 2 functions to allocate and free memory of a float array on all visible cuda devices. here. (The exposed compute functions are not adapted yet ...)

Does that make sense for you?

The workflow I have in mind is sth like:

d_imgs = copy_float_array_to_all_devices(const float *h_array, long long n)
fwd_project(d_imgs, LORs_host_arrays ...)
free_float_array_on_all_devices(float **d_imgs)

I would only split the memory transfer of the image since all the other inputs (LOR coordinates ...) change if we break up the projections in small chunks.

AnderBiguri commented 3 years ago

I think it sounds good to me! I need to discuss a bit more with @KrisThielemans to understand/remember a bit better how the STIR side works, but I believe that what you describe is what we need in STIR to make it real fast.

gschramm commented 3 years ago

Great. I think the main overhead right now is that the 3D image volume gets sent to / from the device for every projection which creates huge overhead when splitting the projections into small profiles. The new approach will avoid that. I am also looking into using several streams and multiple CPU threads (when using multiple GPUs) to reduce the overhead caused by the memory transfer of the LOR parameters.

AnderBiguri commented 3 years ago

@gschramm Yes you are correct. I think the current STIR also sends all LORs to the GPU, even for subsets, but that is an issue in the STIR side.

I am not sure if you can gain much speed with streams/threads though. In theory (have not profiled it, so I am just guessing from the code) you already have simultaneous copies to each GPU because you are using the Async version of the memcopy. The advantage of streams etc is being able to do 2 things at the same time, but you can't really do that on this code, as each part requires the previous. Note that there is only 1 copy engine in each GPU (in some GPUs there are 2 copy engines, one for H2D and another for D2H) so you can not copy 2 things at the same time to a single GPU.

What is your idea here?

Depends on the GPU and image/LOR size, you may gain some copy speed by page-locking the memory (cudaHostRegister). This takes a significant amount of time to do, but then you can memcpy at ~12Gb/s instead of ~2GB/s, so at very small data, its not worth it, but it becomes a time saver at some arbitrary point.

gschramm commented 3 years ago

I started using nsight-sys to profile the projectors and I saw that the copying to multi gpus is serial which can be fixed using multi CPU threads (work in progress). With the streams you are right, it depends on the time needed for memcopy and kernel execution. I will do some more profiling to see where we are and what can be gained by using more steams eventually.

AnderBiguri commented 3 years ago

Hum yes you are right, now that you say it (and I double checked) I remember that "Async" is a bit of a bad name, as its only sometimes async, and I believe it only behaves like such with page-locked memory.

gschramm commented 3 years ago

memcopyasync is only async within one stream. if you change the device you go to a new stream (e.g. the default stream), so in my poor implementation with a normal host for loop, it was actually not async. but simply turning the for loop in an openmp parllel for solves this issue. fortunately, I followed a video tutorial on nsight sys which easily shows you all those problems :)

AnderBiguri commented 3 years ago

With streams, when you change the stream from your host code, you only change which stream you are refering to, and queueing operations on, but if you have queued some operations on a stream asyncronously, they will keep being executed, even if your host is now in a different stream/GPU.

I believe the reason why they are not async is really the page-locking: https://stackoverflow.com/a/46964873/1485872 . This is of course just technicalities, if you indeed make CPU threads, then it does not matter that each CPU thread is locked by that function in a syncronous manner, as you have still called them in parallel, so your issues is solved :)

gschramm commented 3 years ago

Hi Ander, I finished implementing the separation of malloc, compute and free now also for the TOF projectors. I also did some timing test to see how the performance decreases when we split a projection into smaller chuncks (e.g. radial profiles). One thing I noticed is that when the number of LORs per chunk is "small" we get an overhead because of the more calls to cuda_free and cuda_malloc.

How many LORs are typically projecting in STIR in one go?

Georg

AnderBiguri commented 3 years ago

Great news! Ah, "typically" is hard to answer, as in STIR you can make your own scanners and it has support for many. The reason to ask for this is because your GPU code is somewhere hidden in STIR projector classes, and the way STIR projection works is it first computes all LORs, then asks for small subsets of the required projection each time ("related viegrams" they are called).

I admit this is a vague description of the problem, but the very short conclusion to it is that the only way to use your projectors now is to ask for the entire forward projection and then slice the result to get the subsets. You can imagine this is very wasteful, most of the time we are computing stuff that gets dumped. You can see in the conference paper we submitted to MIC that for a big machine with 24 cores, the GPU code is much slower than the CPU code, which should not happen.

Splitting the code into 3 functions will allow us to modify STIR such that it contains the pointers to GPU memory, and only call the computation side, instead of the malloc side. This will already be a huge improvement, even it the sum total of the functions are slower.

gschramm commented 3 years ago

The changes are now merged into the master branch (but I haven't made an official release yet).

And here is a high level example showing all the new steps for fwd and back projection (in my python wrapper) https://github.com/gschramm/parallelproj/blob/d6f1012df00657afec5bd78976e4d87cfb887cc2/pyparallelproj/wrapper.py#L26

What is the definition of a viewgram? Is that one "view" (angle) in a 3D sinogram?

AnderBiguri commented 3 years ago

Fantastic.

STIR has its own way of defining stuff, best if you have a look at the glossary, I'll likely give you a worse answer than what its there! http://stir.sourceforge.net/documentation/STIR-glossary.pdf

gschramm / parallelproj

Separate malloc, compute and free #14