[FEATURE REQUEST] CUDA acceleration?

sn4k3 / UVtools

MSLA/DLP, file analysis, calibration, repair, conversion and manipulation

GNU Affero General Public License v3.0

1.17k stars 103 forks source link

[FEATURE REQUEST] CUDA acceleration? #281

Open apullin opened 2 years ago

apullin commented 2 years ago

Is there any possibility of getting CUDA acceleration enabled for the computationally heavy operations?

I recall from reading some documentation that this is not as simple as just enabling CUDA and rebuilding, and would require explicit re-implementation in places to move data to/from GPU buffers.

So this might be a significant long-term project.

sn4k3 commented 2 years ago

Yes it require code to be rewritten.

I didn't done any test, but my guess is that the performance will be worst compared to a decent CPU. Internet is flooded with same question and bad results around CUDA compared with CPU.

On UVtools most operations are performed 1 time on a image for each layer.

load mat (CPU)
transfer to GPU (CPU+GPU)
perform the operation (GPU)
Transfer back to CPU
Save mat back to layer (CPU)

So we waste time transfering image back and forward just to perform 1 operation, this is performance killer.
Where CUDA wins is when you have a ton of work to do on same mat, and not just 1 operation.
CUDA also benefits on: the larger the image the better performance vs cpu, and video tasks.

UVtools have few areas where it perform multiple operations on same mat, but even still will be a bottleneck, because most of that operations runs and then need to do some checks around the mat/pixels, for that you need to transfer back the image to CPU and use CPU after all.
Another downside is not all CPU operations are availiable on GPU counter-part.

I can conduct a test latter, but i have no hope on better performance.

apullin commented 2 years ago

Interesting.

Along the same lines: is OpenCL accel built into the OpenCV lib? Again, I know little about it, but this SO suggests that it needs to be explicitly enabled: https://stackoverflow.com/questions/28505252/how-do-i-take-the-advantage-of-opencl-in-an-emgu-cv-project

Beyond that, there seems to be a fair amount of controversy around OS-level support for these various accelerator solutions, so some of them might be total dead ends.

wrt Cuda, it seems like some operations could be swapped out directly (with supporting code): https://docs.opencv.org/3.4.0/d5/dc3/group__cudalegacy.html#ga92b4e167cd92db8a9e62e7c2450e4363

for CVInvoke. ConnectedComponentsWithStats: https://github.com/sn4k3/UVtools/blob/08a5797f36867c9b4c9b58da94e61bbea6258ae1/UVtools.Core/Layer/LayerManager.cs#L1157

fwiw, back of the envelope for 16bit mono 4k matrices, a 4GB GPU could nominally hold ~250 layers at once.

An interesting first step might be to profile the current implementation to see if certain OpenCV wrapped functions are taking the bulk of the time? (I would do this, but I haven't gotten local building & running working yet)

sn4k3 commented 2 years ago

Along the same lines: is OpenCL accel built into the OpenCV lib?

It is, and is already in use in some functions by default, you dont need to enable it. You can see your GPU being used when opening a file or save (encoding / decoding is processed via openCL) Eg, see this 3D graph while searching issues:

Taskmgr_2021-09-08_18-17-35

Again, I know little about it, but this SO suggests that it needs to be explicitly enabled:

No, the default is to use openCL if supported, we dont need to enable it

devenv_2021-09-08_18-20-37

fwiw, back of the envelope for 16bit mono 4k matrices, a 4GB GPU could nominally hold ~250 layers at once.

UVtools processes 8bit bitmaps, so its more 500 layers. That also scare me, because these days PC RAM is higher than GPU ram, while CPU does hdd ram when required (SWAP) i have no ideia what would happen on a GPU when run out of memory... Still it will only load n images equal to your degree of paralellism.

I'm waiting on cuda package to update, because right now i can't use it since it's one version behind the main lib. But i have a code ready to test:

But my faith is really low on this... Maybe will benefit 8K images. But poor laptop with low speed interfaces will kill all the performance.

Eg for morph:

if (CoreSettings.CanUseCuda)
            {
                var gpuMat = target.ToGpuMat();
                using var morph = new CudaMorphologyFilter((MorphOp)MorphOperation, target.Depth, target.NumberOfChannels, Kernel.Matrix, Kernel.Anchor, iterations);
                morph.Apply(gpuMat, gpuMat);
                gpuMat.Download(target);
            }
            else
            {
                CvInvoke.MorphologyEx(target, target, (MorphOp) MorphOperation, Kernel.Matrix, Kernel.Anchor, iterations, BorderType.Reflect101, default);
            }

sn4k3 commented 2 years ago

Futher note: I tested UMat (OpenCL) on Resin traps which use many openCV calls, and performance is way worse by huge amount. CPU kept with no usage and GPU fire up, but performance is so bad that don't deserve to be used. Again the only benefit should be on video and AI things where the stream is constant or too many operations on same UMat that lacks CPU operations.

I didn't test CUDA yet but i dont see a future for it either. Will report latter.

apullin commented 2 years ago

Interesting. As far as I know, the standard development cycle for CUDA is to implement, then use the nvidia profiler to check if you are using all your bandwidth for copy to/from GPU, or if one is waiting on the other, then iterate.

It looks like there are few of the OpenCV calls that have accelerated implementations in CUDA. It seems like an interesting project, and I might try to explore it, if/when I get local build working.

One first step (I could open a separate ticket) is: It looks like some of the code could use some refactoring to put the "big" operations behind their own function, rather than inlined in a bigger outer loop. Then, on top of that, if you could get a profiler running, you could measure what the relative impact of each operation is; if one is absolutely dominant, that might the first target for a GPU accel feature.

sn4k3 commented 2 years ago

CUDA sample: Morph erode with 10px CUDA on left (32.42s), CPU on right (8.72s) Again no benefit on simple tasks but a slowdown!

UVtools_2021-10-14_16-27-19

Azmisov commented 2 years ago

If you used pinned gpu memory, then memory transfers will usually not be a bottleneck. You can also use a double buffer to load one layer to process while the other is being processed. Even better is not to treat each mat individually, but load N mats at a time and process them in batch.

I'm surprised your custom CUDA code is 4x slower, especially on an 8s operation. Even with simple operations, I've found GPU to be faster than manually vectorized OpenMP CPU code. Are you making use of shared memory, and made sure you have high kernel occupancy? Image editors often use GPU to do simple one-off filters, so I don't think GPUs only being useful for video and AI is justified.

sn4k3 commented 2 years ago

I don't have experience on GPU calls, but i followed thier calls and pratices. I did test using a for and parallel for to load the mats but got same worse results. The problem here is that UVtools heavy operations work with IFs on mat pixels, and then why have CUDA anyway if you have to convert to CPU mat all the time? It will add Gb of package size for just a few share of users with strong and recent gfx can use it.

Workflow of operation:

Load mat from compressed bytes (CPU)
Clip mat with ROI if any (CPU)
Convert Mat to GpuMat
Do the operation (Can use Cuda here)
Convert GpuMat to Mat
Apply masks if any (CPU)
Encode Mat bytes (CPU)

So we have two transfers just so we can use a Cuda operation per layer A cheap operation, lets say Add two mats, takes 5ms on my cpu per core, which is dirty cheap, thats why i can't see any benefit from using cuda here, we are too dependent from CPU and transfers just kill performance. But fell free to prove me wrong.

On my tests, CPU version just kills OpenCL (UMat) or Cuda (GpuMat) both gets killed easy, maybe i'm doing something wrong, i don't know... If you have experience, fork UVtools and give it a try on a cheap operation like blur or resize, which are easy to convert and report back with results, if it show large benefit i may consider

leefogg commented 1 year ago

I also agree that GPU acceleration would be a great addition to this. However I'd reccomend a larger package such as OpenGL or DirectX, I reccomend these larger paackages because UVTools could better utalize the GPU-side memory, ideally only ever reading back small buffers to show statistics on the UI. For instance, the rendering of the matrix' could be done GPU-side so no need to read back from GPU after an operation to render to the UI.

To add further reasons why GPU acceleration would be much faster:

Groups of function calls can be done at once, no need to subtract, threshold then erode. Reading and writing back data at each stage.
GPU memory is orders of magnitude faster
GPUs can work on many times more pixels simultainously
Higher customizability. Although some OpenCV functions would have to be re-implemented, by knowing more about what you're actually doing, you can write more efficient algorithms

Other benefits include:

3D rendering of the sliced model
More responsive UI. CPU isn't locked up while processing and could show process as its happening.
OpenCV can be used also as fallback (such as being out of GPU memory)

I've been researching into this for the last couple of days as its getting more and more imporant now that resin printers are getting larger resolution screens. My personal one is 8K, with a few thousand layers (for an overnight print) it can take nearly 30 mins to run a few passes finding and fixing issues, most of that time spent waiting for it to process.

sn4k3 commented 1 year ago

It's normal the performance to start to be slower with 8K resolutions, many more pixels to deal. Even then the performance is better then I could expect. You just need to have an proper desktop, laptops are a joke for serious computation. Also make sure you process only 1 copy of the model and pattern them after in UVtools, less pixels on plate the better for UVtools. To gain additional performance select LZ4 on settings.

DirectX is out as it's windows only. OpenCL is one candidate which openCV uses (UMat) but poorly so code require to be written from scratch but I don't have any experience nor know how to write the algorithms.

It also have the GpuMat which is CUDA based, i don't know about it code efficiency but the benchmark I did results in poor performance even if I discard the CPU between the process, CUDA still slower than CPU based solution, so their implementation may also not optimal.

Also note that you can't discard CPU even for the library (not UI), we need to store an cache somewhere, none GPU will hold high count of 8K images on the stack unless you compress them there, today PCs have more RAM than GPU. Then you also have many operations accessing pixels to perform 'if' logic, the requirement of the CPU will always be a bottleneck, unless someone manage to do it directly.

If you know how to start on this there is a library that can help: Amplifier.NET

jremen commented 9 months ago

@sn4k3 What about Neural Engine support on Apple Silicon Macs? This would be a huge boost.

apullin commented 9 months ago

@jremen afaik, the Neural Engine is not a general accelerator, and only usages that can be fit into the coreML framework would be able to target it.

I believe the Apple "Accelerate" library can do general compute on the Apple Silicon GPU, though.

Don't know if there is some general accelerator framework that could be adopted to target both apple GPU and nvidia GPU. OpenCL is deprecated on Apple silicon now, afaik.

That being said: I have run some tests for the "12k" printers out now, and it runs shockingly fast on an M2 Max. I was able to easily post-process a large model, and the compute was < 1 minute in all cases, I believe.

sn4k3 commented 9 months ago

Optimizations should come from openCV library, any framework would not benefit UVtools since a substantial portion of the code and all algorithms come from an external source: OpenCV. Using any kind of accelerator would only benefit if algorithms were written and implemented by self, this is not the case were all depend on openCV. I have no wish to write those algorithms that are already heavy optimized and mature by openCV, plus I'm not into writing platform specific, targeting different frameworks and build complex binaries.

The only possible accelerator to use is CUDA because it is inside openCV, still my tests show a huge performance loss compared to CPU, this is because each object is stored in RAM, must be decoded converted and sent to GPU and them back to CPU and RAM. The CPU code alone is so good optimized that CUDA is defeated with this kind of usage. CUDA would win if everything could stay and processed within the GPU, without much of CPU need, unfortonally that's not the case. Also CUDA lacks many functions that only run in the CPU. There are also multiple reports that openCL and CUDA part of openCV is not so well optimized because of the transfers.

As resolutions grow is normal that processing time increase, not much time ago we were in 1080p, now 12K that is a huge jump in every term (2073600 pixels vs 79626240 pixels). In every computation problem, apart from optimizations, if you want speed you need to boost your hardware. For UVtools give it the best CPU and RAM you can as it will utilize to full.

I have run some tests for the "12k" printers out now, and it runs shockingly fast on an M2 Max. I was able to easily post-process a large model, and the compute was < 1 minute in all cases, I believe.

Most people don't believe how good openCV is in term of image processing, UVtools is also well optimized, just look the fact that you have 12k * n layers loaded into memory and each one is decoded/encoded on access/save time. In addition, for detections and some operations, UVtools works only inside the usable area of image, if you only use 1000x1000 pixels it will crop to that area and work only inside it, this processes faster, however the 12k image still need to be loaded and saved back.

I hope that openCV 5 bring a substential boost. NET 8 will also help the internal part of UVtools, when migrate to it in the future.