Open bertsky opened 7 months ago
I've seen these kinds of memory leaks happen with TF 1, but AFAICR not with TF 2. (See https://github.com/qurator-spk/sbb_column_classifier - I think just upgrading fixed it, but maybe the "TF best practices" were necessary too.)
What I describe happens on TF 2.13.1, which should be fully supported.
This issue is a show-stopper for me, as with OCR-D, it's not even possible to keep the results already produced (since they are only persisted in the METS at the end of the loop).
@mikegerber what do you mean by TF Best Practices – some particular document perhaps?
@mikegerber what do you mean by TF Best Practices – some particular document perhaps?
The things I did in sbb_column_classifier to make it process ~ 20 million pages:
1a. Updating to TF2 1b. IIRC using TF graph execution, TF functions (JIT?)
I'm not sure if I did 1b to fix any memory leaks, may have just been for better performance.
When processing a document of 1.5k pages of medium size (1-2 MP each), I am observing a slow but steady increase in RSS from 4 GB up to 14 GB after 1.2k pages at which point the process gets crashed by the OS (
Killed
).I do not see any Python bindings accessible to the input file loop which could accumulate such data without ever being GCed.
I am on CUDA 11.8
Has anybody seen this before?