mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
750 stars 131 forks source link

Excessive CUDA GPU memory usage for simple jobs #424

Closed drzraf closed 1 year ago

drzraf commented 1 year ago

I'd expect that using a GPU (even a older nvidia one, like a 2GB GM107) is in all case superior to plain-CPU usage. Sadly, It seems that the current implementation doesn't support delegating jobs to GPU in case less than 2GB are available.

I tried multiples settings to the (underdocumented) PYTORCH_CUDA_ALLOC_CONF (like max_split_size_mb:64,roundup_bypass_threshold_mb:64 but something as simple as:

kraken -vrd cuda:0 -i foo.jpg foo.txt segment -bl ocr -m lectaurep_base.mlmodel

always lead to an OutOfMemoryError:

CUDA out of memory. Tried to allocate 1.15 GiB (GPU 0; 1.96 GiB total capacity; 964.64 MiB already allocated; 266.12 MiB free; 988.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Memory increase follow the follow pattern:

  1. At launch [...], 2MiB
  2. Before `Loading model from [...], 373MiB
  3. After `Loading model from [...], 664MiB
  4. At INFO Segmenting [blla.py:290], 1092MiB
  5. At INFO Segmenting [blla.py:75], 1802MiB (error)

It sounds overly exaggerated (and prejudicial) in order to segment/ocr a 1.7MB JPG and a 16MB model. In the integration with torch isn't there room for tweaking the job/batch/memory size so that (common) GPUs could be used?

mittagessen commented 1 year ago

Just claiming memory use is excessive because of encoded JPG image and model weight size is ... weird .. to say the least. You're not going to get the segmenter to work on a GPU with much less than 10GB GPU memory no matter what kind of optimization (or framework) you're running. LSTMs use memory, that's just how it is.

Luckily, ~50% of the segmentation time is actually spent in post-processing so the speedup provided by GPUs is rather limited in any case. The same is true for recognition inference. You might just be able to fit recognition training on your small GPU but it will be a tight fit.

drzraf commented 1 year ago

10GB is unrealistic for most humans. At least should this limitation/prerequisite be clearly mentioned in the README/docs. Let's hope PyTorch 2.0 will decrease this amount. On a common laptop, recognition takes up to 40 seconds which is suitable for a dozen images at most. Said otherwise, batch OCR-ing thousands of images is still out of question.