red6 / pdfcompare

A simple Java library to compare two PDF files
Apache License 2.0
220 stars 66 forks source link

Readme mentions CompareResultWithDiskStorage but class is (no longer?) in jar #5

Closed PascalSchumacher closed 7 years ago

PascalSchumacher commented 7 years ago

as of version 1.1.17

finsterwalder commented 7 years ago

Yes, CompareResultWithDiskStorage was removed. I updated the README. Use new CompareResultWithPageOverflow(1) to write every page immediately to a file. Writing images had the problem, that the page size is lost that way. And it was also less performant.

PascalSchumacher commented 7 years ago

Thanks! :+1:

PascalSchumacher commented 7 years ago

Outside of debugging CompareResultsWithPageOverflow(1) actually uses more memory than CompareResultsWithPageOverflow() :(

I'm not sure why, but it seems like not enough BufferedImage objects are garbage collected.

CompareResultWithMemoryOverflow also does not seem to work.

The only way to reliable reduce memory usage seems to be restoring CompareResultWithDiskStorage.

finsterwalder commented 7 years ago

CompareResultWithDiskStorage does not work correctly and never did. It has the conceptual problem, that the resulting page size is lost along the way. And there is no reason, why it should use considerably less memory than the other CompareResults, really. I will have to investigate, why it uses so much memory. Maybe it's because of caching in PdfBox. I had problems with PdfBox caching before. I added a part in the README to describe some memory settings. Maybe you can tweak those to lower values and see what happens. Do you get out of memory errors? Home much heap does your JVM have? Did you monitor heap usage?

PascalSchumacher commented 7 years ago

Thanks for the reply. :+1:

I will give tweaking those values a try.

To replicated problems we have during build on ci I compared two pdfs six pages each.

CompareResultsWithPageOverflow() required a heap of -Xmx512m, or I would get OutOfMemoryExceptions. CompareResultsWithPageOverflow(1) required slightly more memory, something like -Xmx532m.

With CompareResultWithDiskStorage I was able to reduce heap memory to something like -Xmx400m. (I expected this to be lower.)

I could not really figure out were all the memory went. A BufferedImage of a pdfpage was around 25 MB of memory. The pdf are only around 100 KB on disk an only a few MB of memory when loaded with PDFBox.

I hacked together a single threaded version (no executors) of pdfcompare that runs the same comparison with -Xmx220m using CompareResultWithPageOverflow(1). In this version CompareResultWithPageOverflow works as expected, because CompareResultWithPageOverflow(1) requires less memory that CompareResultWithPageOverflow().

finsterwalder commented 7 years ago

I must admit, that very low memory consumption is not a main goal for me. I only want to keep memory consumption somewhat in check, which is difficult enough. :-) Creating multiple threads and multiple PdfBox instances with their own caches does use up some extra memory. But I rather give more memory and improve performance by parallelization. Processing new pages, while the previous pages are not yet written to disk also uses more memory, than doing it single threaded.

finsterwalder commented 7 years ago

Hi Pascal,

did PdfCompare work for you now?

PascalSchumacher commented 7 years ago

Hi Malte,

thanks for asking. :)

PdfCompare works great for us. :+1:

For reasons beyond our control our ci builds have only a ridiculous small amount of memory. Therefore we split the PDFs page by page using PDFBox and then compare these single page pdfs with PdfCompare.

finsterwalder commented 7 years ago

Good to hear that it helps you! Did I understand correctly, that a single threaded version that splits after each page would work even better for you? But the muti threaded version uses too much memory? Maybe I can introduce an option that controlls the threading. I will think about that...

PascalSchumacher commented 7 years ago

That is correct.

A BufferedImage takes 35 MB of memory and PdfCompare can keep quite a few in memory (either currently processed by a thread or in a queue/map waiting for a free thread).

I think our memory constraints are extreme/unusual, so do not feel bad if can not cleanly add a single threaded option.

By the way: In my (very limited) testing (tested only a single small 6 page pdf with differences on every page on my quadcore laptop), only AbstractCompareResultWithSwap#swapExecutor actually speed up the comparison (by 25%) compared to a completely single-threaded version of PdfCompare.

finsterwalder commented 7 years ago

I created a config switch, that allows to disable all threading. See the README.md. It is in releas 1.1.19, but it may take some hours until it's available in maven central.

PascalSchumacher commented 7 years ago

Thanks! :+1:

It is already available in maven central.

I was able to rerun my tests with just 180 MB of heap memory with 1.1.19 (test takes 60% longer that the multi-threaded version). We can now remove our workaround.

Thanks again!