Excessive CUDA GPU memory usage for simple jobs

drzraf commented 1 year ago

I'd expect that using a GPU (even a older nvidia one, like a 2GB GM107) is in all case superior to plain-CPU usage. Sadly, It seems that the current implementation doesn't support delegating jobs to GPU in case less than 2GB are available.

I tried multiples settings to the (underdocumented) PYTORCH_CUDA_ALLOC_CONF (like max_split_size_mb:64,roundup_bypass_threshold_mb:64 but something as simple as:

kraken -vrd cuda:0 -i foo.jpg foo.txt segment -bl ocr -m lectaurep_base.mlmodel

always lead to an OutOfMemoryError:

CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 1.96 GiB total capacity; 964.64 MiB already allocated; 266.12 MiB free; 988.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Memory increase follow the follow pattern:

At launch [...], 2MiB
Before `Loading model from [...], 373MiB
After `Loading model from [...], 664MiB
At INFO Segmenting [blla.py:290], 1092MiB
At INFO Segmenting [blla.py:75], 1802MiB (error)

It sounds overly exaggerated (and prejudicial) in order to segment/ocr a 1.7MB JPG and a 16MB model. In the integration with torch isn't there room for tweaking the job/batch/memory size so that (common) GPUs could be used?

AFAIU this project https://github.com/ljk53/pytorch-op-deps was developed in order to do model profiling/static analysis so that all pytorch operators' models aren't forcefully loaded at bootstrap (sparing a couple of hundred MB reaching the GPU). I believe much of this work went into PyTorch 2.0
An alternative would be to use Tensorflow

mittagessen commented 1 year ago

Just claiming memory use is excessive because of encoded JPG image and model weight size is ... weird .. to say the least. You're not going to get the segmenter to work on a GPU with much less than 10GB GPU memory no matter what kind of optimization (or framework) you're running. LSTMs use memory, that's just how it is.

Luckily, ~50% of the segmentation time is actually spent in post-processing so the speedup provided by GPUs is rather limited in any case. The same is true for recognition inference. You might just be able to fit recognition training on your small GPU but it will be a tight fit.

drzraf commented 1 year ago

10GB is unrealistic for most humans. At least should this limitation/prerequisite be clearly mentioned in the README/docs. Let's hope PyTorch 2.0 will decrease this amount. On a common laptop, recognition takes up to 40 seconds which is suitable for a dozen images at most. Said otherwise, batch OCR-ing thousands of images is still out of question.

mittagessen / kraken

Excessive CUDA GPU memory usage for simple jobs #424