Interpolation on GPU - Githubissues

simeks commented 6 years ago

There are quite large precision errors when doing linear interpolation with CUDA as they compute texture coordinates using only 8 bit precision (see here).

I noticed this when comparing the results between the GPU and the old version of the cost functions. It probably won't have any effect on the submodularity as it only affects the unary term, but it could impact on the registration results. The error is around 0.8%.

We could do the interpolation ourselves, but this will most likely have a large impact on performance. I will write a benchmark today to see how big the performance impact is. If it's not too big we could probably go for that, otherwise we should probably discuss this as it's a question about quality/performance.

m-pilia commented 6 years ago

Interesting, thanks. Well, the built-in texture interpolator is designed for graphics, not HPC, but the performance gain is likely to be significant, so it may be worth to check out the actual impact on the registration quality. When you say the error is around 0.8%, which error do you refer to exactly? On the resampled values? Or the final registration energy?

simeks commented 6 years ago

Sorry, I meant the error in the sampled value.

m-pilia commented 6 years ago

No problem, thanks. Maybe we can check how much that actually affects the registration. I guess the effect will probably be equivalent to a very slight low-pass filtering of the moving image, but it is hard to tell in advance how much it will affect the quality.

I was thinking to setup some ROI-based evaluation for the registration, since the energy is totally unreliable as a measure of alignment. I guess you already have something like that with the brain images, I was thinking to prepare a similar benchmark based on POEM data.

simeks commented 6 years ago

Yeah, that would be great. I have a small setup for brain images but the POEM would definitely be a lot better. If the networking at ing24 wasn't such a mess I would probably be able to put some automated evaluationen on Kerstin. I could maybe make a bot that pulls every new commit and posts the result to twitter, haha.

I did a naive GPU version of linear_at now. It's about 30% slower which isn't too bad. I could probably speed it up even more.

simeks commented 6 years ago

I did some minor optimizations and now it seems to perform the same as the CUDA interpolation on GTX 1080 Ti. It is still slower by ~25% on my laptop but that's not really our target hardware anyways.

This makes me happy as I've been thinking a lot about the texture vs pitched array for our volumes. As we seem to be able to perform the same either way I think we should probably avoid textures. They're really fast and cache-efficient but handling them outside the kernel is a bit of a hassle.

Will need to do some further benchmarking though, as we use textures in other areas as well.

m-pilia commented 6 years ago

That's very good news! I cannot run the benchmarks on Linux now because these days my 1080 Ti is busy trainig a DNN, and I have no CUDA device on the laptop. Maybe I can run them on the old GTX 660, for what it is worth... I am keeping an eye on the code though, over the weekend I was mostly busy porting disptools to CUDA. I should be done in a couple days, so I will be able to have a closer look at stk.

For the evaluation, hitherto I used mostly MSE and MI as alignment measures, plus AICE for deformation quality, but it turns out all these measures are almost useless as quality predictors. I have a bunch of organ segmentations for the POEM data, so I think I will move to Jaccard of the warped ROIs et similia, and see what happens. The registration of many subjects can be quite expensive, hence I was thinking to cut the volumes and register only the torso (which is the most challenging part, plus we have no ground truth in the limbs anyways).

I will put together an evaluation script in the next days. The problem with POEM data is the quite low resolution of the images, so I am not sure how sensitive such approach will be...

simeks commented 6 years ago

Reminds me that I probably need to ask Robin for a GPU at work because I won't be able to do large scale registrations on my own. Maybe a RTX 2080, haha. I think currently the largest contributor to the CUDA aspect of STK is gpu_volume.cpp. I will probably look into adding some abstraction for streams as well this week.

This is a hard problem, especially for whole-body. It will be difficult to find a one-fits-all solution so we should probably use several metrics. We could add something like in Priscillas manuscript, where she evaluated the method using fat and tissue volume. What's the runtime of one POEM wb?

m-pilia commented 6 years ago

Yeah, you should probably get one. BTW, the 2080 Ti has a lot more CUDA cores, so it should be worth it for the registration. I am a bit disappointed by the 11GB memory, I was really hoping for an increase since 11GB is too little for some DNNs I am using. But that should not matter for the registration...

With SSD metric, one POEM wb volume takes something around:

200-300 seconds on a i7-3770
90-120 seconds on a i7-7700
60-80 seconds on a Ryzen 7 2700 X

With NCC it is something like 15-20 times slower than that.

For the quality metrics, I haven't seen her manuscript, however that seems a reasonable measure to add. Yesterday I also got the idea to compute a multiscale soft probabilistic Dice (or Jaccard) for the organ segmentations, instead of the classic Dice/Jaccard, that should give a more smooth measure I guess.

simeks commented 6 years ago

I only hope Robin isn't too cheap, haha.

Ah, and we probably want to run a large batch of subjects for reliable results. The drawback of partial-body is that we won't detect if a change causes these ugly leg-bugs.

I have a hard-time trusting typical Dice score for whole-body so that could be interesting. Dice may work for brains, but I think the large variability in the region sizes for whole-body makes it hard to interpret.

simeks commented 6 years ago

Crap... The use of textures doesn't really matter for SSD, but for NCC there's like a 3x difference in favor of textures. I think it's mostly because NCC really benefits from using the texture cache. I will have to look into this a bit further.

Another annoying thing I've spent too many hours on today: In Windows there's a timer killing the kernel after like 5 seconds if you use the GPU as the main output for graphics. All I got was a error telling me "unspecified launch failure" so it took quite a while to figure out...

m-pilia commented 6 years ago

Yeah, that sucks. The lack of a cache mechanism is what I find the most annoying part of GPGPU, it makes memory access patterns much more influential. And with NCC, the memory access is a total mess on its own. Yesterday Filip was suggesting a possible way to approximate NCC with a Taylor expansion, that should make the algorithm as fast as SSD, I am really curious to try that approach and see how it performs.

Lol, good to know, thanks. I had no idea there was such a thing. It sounds pretty dumb though, but it is a long time I am not even trying to understand engineering choices at Microsoft anymore. However I get their point, when doing intensive GPGPU on the grahpics output the machine can become quite unresponsive, it seems GPU's schedulers are not as smart as CPU's. What I usually do under Linux is to shutdown Xorg and run GPU intensive stuff on a console session (which also saves quite some GPU memory), but that's not an option on Windows I think...

simeks commented 6 years ago

We could quite easily cache the fixed volume using shared memory, but I don't think we'll get any real benefit. The moving volume is a whole different thing but I think a caching mechanism there would result in a lot of flow control, which would slow things down.

He have mentioned Taylor expansion previously, it's an interesting idea but I'm skeptical. I hope I'm wrong.

You could put it in compute mode but then you'll need to do all display output through another card. I managed to increase the timeout so I should be fine as long as the computations doesn't exceed 10s. I'm thinking that if I get a new workstation at work I'll probably make it a pure work horse and use my laptop for the usual stuff. I've grown quite fond of the job queuing system they use over here.

m-pilia commented 6 years ago

Well, when using shared memory for manual caching, "easily" is relative, as you said it can take quite some additional code (and computing work)...

For the Taylor expansion, the idea I got is basically to register the "normalised" gradients of the images instead of the images themselves. It has already been done and it seems to work nice, probably not as good as NCC but if it is as fast as SSD it could be a useful compromise in practice. I was thinking to also consider higher order derivatives (maybe adding a Laplacian term to the registration), and see what comes out. The drawback of this approach is that it totally ignores intensities, but we can add a weighted SSD term if we want to account for them too.

Yeah, that's what I am also doing now, I have one workstation for ML, one for CPU-intensive stuff, and the laptop for the rest ahah I think that's one good reason to consider a good Core i7/i9 over a Xeon when configuring a ML-enabled small desktop workstation: to have an integrated card that takes care of the video output. Yeah, it has no ECC and half the memory bandwidth, but unless you have one of those super expensive 8/12/24/36 core, the difference in computing power is often absent or not worth the price difference (plus you possibly don't need it if you rely on GPU power). Some of the Xeon we have bought in recent installations at the hospital are quite crap IMHO, I explicitly asked for a specific CPU upgrade when ordering my new workstation. On my personal workstation I went for a Ryzen 7 2700 X instead, it is a very nice toy but it also lacks integrated graphics, and that's what I miss most.

simeks commented 6 years ago

For the fixed volume shared memory is really simple as you got a 1-1 mapping, you simply copy everything in the beginning of the kernel and sync. For the moving volume you don't really know, two neighbouring displacement could be really far from eachother.

Ahh, yeah, that's the same thing I've tried previously. It worked very well for large gradient steps in the images, like contours. However, more homogenous areas were troublesome. I was thinking about weighting it with SSD to get a good middle-ground but then I got distracted by other things.

Yeah, there's a lot of things to consider. Probably need to discuss with Robin when I'm back from the course. There have been talk about a dedicated imiomics machine, and then they would probably be willing to pay more. But at the same time, then I'll be responsible for a lot of Imiomics analysis and probably need to share it with others, haha.

simeks commented 6 years ago

What the... For the benchmarks I just pushed there's no real difference in runtime between texture and pitched memory. There has to be some bug somewhere.

simeks commented 6 years ago

Nvm, there was a bug in the benchmark so textures are still faster. Doh...

m-pilia commented 6 years ago

I was thinking about weighting it with SSD to get a good middle-ground

Same idea, maybe with an adaptive weight that is high when the gradient magnitude is low, and vice versa. This way, the gradient information dominates on contours without SSD messing around, but we got at least some SSD information where the gradient is flat.

Nvm, there was a bug in the benchmark so textures are still faster. Doh...

Too bad... Anyway, I am done with the training of the DNN, so maybe later I will try the CUDA benchmarks on my workstation (if I manage it, I am already working on something like four tasks in parallel...)

simeks commented 6 years ago

That's sounds interesting. Should be quite easy to setup. I guess the time-consuming part is finding good parameters.

simeks commented 6 years ago

I'll close this for now. The interpolation will probably be a topic for discussion when we actually get something running on the GPU though.

simeks / deform

Interpolation on GPU #23