uds-datalab / PDBF

PDBF - A Toolkit for Creating Janiform Data Documents
Other
49 stars 6 forks source link

Memory issues #41

Open NeutralKaon opened 7 years ago

NeutralKaon commented 7 years ago

Hi there,

I note that in the code you've put a TODO helpfully saying //TODO: make option to change this [buffer size]. Add F.A.Q entry for java.lang.OutOfMemoryError: Java heap space.

I've been trying to use PDBF perhaps inappropriately to make a nice, shiny html5 version of my [completed] PhD thesis. It's not that crazy, doesn't do anything mad, contains about 1e6 characters of TeX, and compiles to a 300 MiB pdf with pdflatex.

I can routinely bust not only the Java heap but the 32-bit int limit on array size if I just increase the heap by -Xmx16g -XX:+UseCompressedOops -XX:+DisableExplicitGC (or equivalent). The full trace is:

Compiling HTML...
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
    at java.lang.AbstractStringBuilder.replace(AbstractStringBuilder.java:834)
    at java.lang.StringBuilder.replace(StringBuilder.java:262)
    at pdbf.misc.Tools.fixXref(Tools.java:246)
    at pdbf.compilers.HTML_PDF_Compiler.main(HTML_PDF_Compiler.java:112)
    at pdbf.PDBF_Compiler.main(PDBF_Compiler.java:167)

It would be really nice to be able to get around this somehow, but I recognise that there's a lot of work re-building the codebase to not read the whole pdf into memory at once. Do you have any ideas?

Thanks for a great project!

IchbinkeinReh commented 7 years ago

Hi NeutralKaon, could you please check if the problem also exists if you use version 1.2.5? Regards Patrick

PS: The PDF you get out of the PDBF compiler is not really HTML5. There is no LaTeX to HTML5 converter in this Project. Your PDF will only be combined with the pdf.js library and can then be viewed in the browser. PPS: The current implementation will always output HTML that is at least double the size of your PDF file. That is because the PDF content is saved twice, once for PDF viewers and once for browsers.