qurator-spk / sbb_binarization

Document Image Binarization
Apache License 2.0
72 stars 14 forks source link

OCR-D processor is leaky #66

Open bertsky opened 7 months ago

bertsky commented 7 months ago

When processing a document of 1.5k pages of medium size (1-2 MP each), I am observing a slow but steady increase in RSS from 4 GB up to 14 GB after 1.2k pages at which point the process gets crashed by the OS (Killed).

I do not see any Python bindings accessible to the input file loop which could accumulate such data without ever being GCed.

I am on CUDA 11.8

Has anybody seen this before?

mikegerber commented 5 months ago

I've seen these kinds of memory leaks happen with TF 1, but AFAICR not with TF 2. (See https://github.com/qurator-spk/sbb_column_classifier - I think just upgrading fixed it, but maybe the "TF best practices" were necessary too.)

bertsky commented 5 months ago

What I describe happens on TF 2.13.1, which should be fully supported.

This issue is a show-stopper for me, as with OCR-D, it's not even possible to keep the results already produced (since they are only persisted in the METS at the end of the loop).

@mikegerber what do you mean by TF Best Practices – some particular document perhaps?

mikegerber commented 4 months ago

@mikegerber what do you mean by TF Best Practices – some particular document perhaps?

The things I did in sbb_column_classifier to make it process ~ 20 million pages:

1a. Updating to TF2 1b. IIRC using TF graph execution, TF functions (JIT?)

  1. Dealing with flow problems due to the interweaved CPU processing (Would probably look into using some kind of bounded queues now, but solved it using semaphores at the time.)

I'm not sure if I did 1b to fix any memory leaks, may have just been for better performance.