Closed PascalSchumacher closed 7 years ago
Yes, CompareResultWithDiskStorage was removed. I updated the README. Use new CompareResultWithPageOverflow(1) to write every page immediately to a file. Writing images had the problem, that the page size is lost that way. And it was also less performant.
Thanks! :+1:
Outside of debugging CompareResultsWithPageOverflow(1)
actually uses more memory than CompareResultsWithPageOverflow()
:(
I'm not sure why, but it seems like not enough BufferedImage
objects are garbage collected.
CompareResultWithMemoryOverflow
also does not seem to work.
The only way to reliable reduce memory usage seems to be restoring CompareResultWithDiskStorage
.
CompareResultWithDiskStorage does not work correctly and never did. It has the conceptual problem, that the resulting page size is lost along the way. And there is no reason, why it should use considerably less memory than the other CompareResults, really. I will have to investigate, why it uses so much memory. Maybe it's because of caching in PdfBox. I had problems with PdfBox caching before. I added a part in the README to describe some memory settings. Maybe you can tweak those to lower values and see what happens. Do you get out of memory errors? Home much heap does your JVM have? Did you monitor heap usage?
Thanks for the reply. :+1:
I will give tweaking those values a try.
To replicated problems we have during build on ci I compared two pdfs six pages each.
CompareResultsWithPageOverflow()
required a heap of -Xmx512m
, or I would get OutOfMemoryExceptions
. CompareResultsWithPageOverflow(1)
required slightly more memory, something like -Xmx532m
.
With CompareResultWithDiskStorage
I was able to reduce heap memory to something like -Xmx400m
. (I expected this to be lower.)
I could not really figure out were all the memory went. A BufferedImage
of a pdfpage was around 25 MB of memory. The pdf are only around 100 KB on disk an only a few MB of memory when loaded with PDFBox.
I hacked together a single threaded version (no executors) of pdfcompare that runs the same comparison with -Xmx220m
using CompareResultWithPageOverflow(1)
. In this version CompareResultWithPageOverflow
works as expected, because CompareResultWithPageOverflow(1)
requires less memory that CompareResultWithPageOverflow()
.
I must admit, that very low memory consumption is not a main goal for me. I only want to keep memory consumption somewhat in check, which is difficult enough. :-) Creating multiple threads and multiple PdfBox instances with their own caches does use up some extra memory. But I rather give more memory and improve performance by parallelization. Processing new pages, while the previous pages are not yet written to disk also uses more memory, than doing it single threaded.
Hi Pascal,
did PdfCompare work for you now?
Hi Malte,
thanks for asking. :)
PdfCompare works great for us. :+1:
For reasons beyond our control our ci builds have only a ridiculous small amount of memory. Therefore we split the PDFs page by page using PDFBox and then compare these single page pdfs with PdfCompare.
Good to hear that it helps you! Did I understand correctly, that a single threaded version that splits after each page would work even better for you? But the muti threaded version uses too much memory? Maybe I can introduce an option that controlls the threading. I will think about that...
That is correct.
A BufferedImage
takes 35 MB of memory and PdfCompare can keep quite a few in memory (either currently processed by a thread or in a queue/map waiting for a free thread).
I think our memory constraints are extreme/unusual, so do not feel bad if can not cleanly add a single threaded option.
By the way: In my (very limited) testing (tested only a single small 6 page pdf with differences on every page on my quadcore laptop), only AbstractCompareResultWithSwap#swapExecutor
actually speed up the comparison (by 25%) compared to a completely single-threaded version of PdfCompare.
I created a config switch, that allows to disable all threading. See the README.md. It is in releas 1.1.19, but it may take some hours until it's available in maven central.
Thanks! :+1:
It is already available in maven central.
I was able to rerun my tests with just 180 MB of heap memory with 1.1.19 (test takes 60% longer that the multi-threaded version). We can now remove our workaround.
Thanks again!
as of version
1.1.17