qurator-spk / sbb_binarization

Document Image Binarization
Apache License 2.0
69 stars 14 forks source link

Fix space leak #17

Closed sulzbals closed 3 years ago

sulzbals commented 3 years ago

It looks like sbb-binarization uses a lot more memory than it needs, potentially leading the host machine to run out of memory depending on the available RAM and the size of the workflow.

Let k be the number of models used and n be the number of images in the workflow. By looking at the code we can see that:

1) The program instantiates n tensorflow session objects despite needing only 1; 2) The program instantiates kn model objects despite needing only k.

These issues are solved by moving some routines called directly or indirectly by the run method to the __init__ method. I ran memory profilings (valgrind massif) of sbb-binarization both with and without the changes proposed in this PR using the DIBCO11 assets provided by OCR-D and plotted the data with matplotlib to demonstrate (check massif.zip for the actual massif.out files).

msparse

The unnecessary memory allocations causes sbb-binarization's RAM usage to gradually increase over time, reaching over 7GB in the end. With this PR, the memory consumption stabilizes around 1GB. The process also takes only 84% of the original time to finish since a lot of instantiation routines are not unnecessarily repeated.