Batch-prediction across multiple GPUs and more efficient patch-prediction

apacha commented 1 year ago

In order to batch-binarize thousands of images, I've rewritten the prediction script to allow us to predict around 1500-2000 images per hour on a decent machine with two GPUs.

The proposed changes include:

An efficient way to compute the image patches instead of a very inefficient loop
Complete removal of the prediction on the down-scaled image as the results are pretty much always worse
Batch-prediction code that can binarize an entire directory into a given output directory while preserving the folder structure and skipping images that have already been binarized, to allow stopping and continuing the conversion
Multiprocessing batch-prediction across multiple GPUs using the mpire library
A fix for the memory-leak that caused mass-binarization to very quickly crash because we were running out of memory on the GPU. With this fix, we are already running the conversion for 16 hours without any crash.
Simplified loading of the model removing obsolete session-handling code

Please note: I know that the code looks completely different now (hopefully more readable) and is probably not 1:1 compatible with the remaining code in your repository, but I tried to put all the relevant changes into this PR and make the code as self-contained as possible to allow you to update the solution as you see fit.

Thanks for sharing the code-base with us. I hope that this PR is of some help to you.

vahidrezanezhad commented 1 year ago

Dear @apacha ,

Few things like possible OOM error , binarization with collection of models and exact processing time improvement will be tested before merging your PR. By the way as Qurator team we would like to thank you for your efforts for improving sbb_binarization tool.

vahidrezanezhad commented 1 year ago

In order to batch-binarize thousands of images, I've rewritten the prediction script to allow us to predict around 1500-2000 images per hour on a decent machine with two GPUs.

The proposed changes include:

An efficient way to compute the image patches instead of a very inefficient loop

Complete removal of the prediction on the down-scaled image as the results are pretty much always worse

Batch-prediction code that can binarize an entire directory into a given output directory while preserving the folder structure and skipping images that have already been binarized, to allow stopping and continuing the conversion

Multiprocessing batch-prediction across multiple GPUs using the mpire library

A fix for the memory-leak that caused mass-binarization to very quickly crash because we were running out of memory on the GPU. With this fix, we are already running the conversion for 16 hours without any crash.

Simplified loading of the model removing obsolete session-handling code

Please note: I know that the code looks completely different now (hopefully more readable) and is probably not 1:1 compatible with the remaining code in your repository, but I tried to put all the relevant changes into this PR and make the code as self-contained as possible to allow you to update the solution as you see fit.

Thanks for sharing the code-base with us. I hope that this PR is of some help to you.

Dear @apacha ,

I got this error with your pull request

Traceback (most recent call last): File "/home/vahid/Documents/sbb_binarization/bin_envnew/bin/sbb_binarize", line 8, in sys.exit(main()) File "/home/vahid/Documents/sbb_binarization/bin_envnew/lib64/python3.6/site-packages/click/core.py", line 1128, in call return self.main(args, kwargs) File "/home/vahid/Documents/sbb_binarization/bin_envnew/lib64/python3.6/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/home/vahid/Documents/sbb_binarization/bin_envnew/lib64/python3.6/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/vahid/Documents/sbb_binarization/bin_envnew/lib64/python3.6/site-packages/click/core.py", line 754, in invoke return __callback(args, **kwargs) File "/home/vahid/Documents/sbb_binarization/bin_envnew/lib64/python3.6/site-packages/sbb_binarize/cli.py", line 15, in main SbbBinarizer(model_dir).run(image_path=input_image, use_patches=patches, save=output_image) TypeError: init() takes 1 positional argument but 2 were given

apacha commented 1 year ago

You get this exception as I've rewritten the SbbBinarizer class to split loading the model from initializing the object, see https://github.com/qurator-spk/sbb_binarization/pull/48/files#diff-3b757c0dba09b4add9cf8c0566db9834fdb19e6af8e5aaf538e8810b1ac613eeR164 I'll update the cli.py as well

bertsky commented 1 year ago

See https://github.com/apacha/sbb_binarization/pull/1 for an update of the PR.

bertsky commented 1 year ago

Not sure if this is the right place for a discussion, but IMO this is not the right approach for efficient prediction yet. We should define a tf.data pipeline, allowing pipelining between our (intensive!) CPU pre- and postprocessing and the GPU side. On that occasion, multithreading on the CPU side (reshaping and contour finding with OpenCV) should be attempted, too...

Also, perhaps at least some of the OpenCV stuff can be ported to cv2.cuda calls – I know, it would still involve multiple additional CPU-GPU transfers (since the cv2.cuda is not in the same graph as the tf/keras part). But at least the GPU would be utilised a bit more.

bertsky commented 1 year ago

Sorry, in my previous comment I was thinking more about Eynollah than the Binarizer (hence the heavy CPU part). And @apacha's PR does already speed up by an order of magnitude. I can see minor differences between result from tf.image.extract_patches and the old numpy tiling. But the quality is still very good. (I can provide examples before/after, both for the old ensemble model and the 2021 version – let me know if you are interested.) My argument was only meant about the direction of future work we should undertake.

bertsky commented 1 year ago

tf.data pipelining with heavy CPU processing itself seems to be hard to get right: to get true parallelisation, one probably needs tfaip...

qurator-spk / sbb_binarization

Batch-prediction across multiple GPUs and more efficient patch-prediction #48