Large memory consumption

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

1. Create a JUnit test with a simple document generation (any document will do)

The code used looks something like this:

        String templateRelativeFilePath = "/templates/report.docx";
    InputStream in = OutOfMemoryTest.class
            .getResourceAsStream(templateRelativeFilePath);
    IXDocReport report = XDocReportRegistry.getRegistry().loadReport(
                    in, TemplateEngineKind.Velocity);
    if (fieldsMetadata != null)
        report.setFieldsMetadata(fieldsMetadata);
    IContext context = report.createContext();
    if (model != null) {
        for (String key : model.keySet())
            context.put(key, model.get(key));
    }
    Options options = Options.getTo(ConverterTypeTo.PDF);
    String filePath = "D:/" + templateName + ".pdf";
    FileOutputStream fos = new FileOutputStream(filePath );
    report.convert(context, options, fos);

2. Start VisualVM to monitor memory consumption
3. Run the test

What is the expected output? What do you see instead?
The expected output is a moderate memory consumption but instead a clear burst 
appears the moment the 'report.convert(...)' instructions kicks in.

What version of the product are you using? On what operating system?
fr.opensagres.xdocreport.document.docx 1.0.0 
fr.opensagres.xdocreport.converter.docx.xwpf 1.0.0
fr.opensagres.xdocreport.template.velocity 1.0.0

Windows

Please provide any additional information below.

The memory burst at report conversion is roughly about 30-50 Mb RAM which under 
normal operating conditions isn't that big of a deal (well all things 
considered) but when a web server application hits the same wall in production 
with limited memory capacity (-Xms128m -Xmx256m), this results in an 
OutOfMemory exception that kills the server.

Is this the expected memory consumption or is there a way to fine-tune the tool 
in order to avoid any significant kind of overhead (clear the caches, use 
external disk space for documents creation and so on)?

Regards.

Original issue reported on code.google.com by hris...@gmail.com on 28 Oct 2013 at 9:09

GoogleCodeExporter commented 8 years ago

Hi hrisnew,

To be honnest with you, we have never done some study about memory consumption 
(and me I have never played with some tools like JMeter).

It seems that memory consumption grows whit conversion. It shoud be interesting 
to see of the problem is the same when docx is converted diretcly to pdf (see 
https://code.google.com/p/xdocreport/wiki/XWPFConverterPDFViaIText)

When report is converted to pdf by using converter, the process is : 

1) generate docx report and tores it in byte array
2) use the byt array and call the converter.

At this step memory could grows because we stores the generated docx report in 
byt array.

It seems that you use XWPDF converter, the process for that it's : 

1) load docx stream in a XWPFDocument with POI
2) loop for each structure of Apache POI (XWPFPararaph, etc) to generate iText 
structure.

So it should be interesting if it's 1) or 2) which consumes memory.

If you could help us with the "XDocReport memory consumption" topic it should 
be very cool.

Many thank's

Regards Angelo

Original comment by angelo.z...@gmail.com on 28 Oct 2013 at 4:39

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

Bonjour Angelo,

Je suppose que vous êtes francophone et me permet donc de poster ce message en 
français pour plus de clarté. 

Effectivement la manière dont nous avons procédé au départ consiste à 
faire ce qui suit (comme illustré dans l'extrait de code du commentaire de 
départ):
1. Nous faisons un load du template cible dans le Registry et obtenons un 
IDoxReport en sortie (pour optimiser nous essayons également de ne pas 
reloader le même template mais plutôt de voir si le rapport correspondant au 
chemin qui nous intéresse n'a pas déjà été caché).
2. Nous settons les options nécessaires ainsi que toutes les données du 
modèle java sur le contexte (ceci est évidemment incontournable étant donné 
que nous utilisons XDocReport pour générer des rapports et non pour convertir 
des documents existants);
3. Nous procédons à la conversion du modèle en PDF et envoyons le résultat 
dans un fichier sur disque;

Suite à votre proposition, j'ai adapté la suite des actions comme suit: au 
lieu de convertir au point 3, je stocke d'abord le rapport en .docx, et 
seulement après effectue une conversion directe du .docx en .pdf comme 
suggeré.

Résultat des courses: le processing en .docx prend plus ou moins deux fois 
moins de mémoire (la bosse dans le monitoring de VisualVM ne s'étend plus que 
sur +/- 25 Mb ce qui est déjà mieux mais tout de même assez élevé). La 
conversion qui suit par contre fait de nouveau exploser la mémoire: + 25/30 
Mb. 

A première vue, découper le processus en deux semble donc améliorer la 
situation mais le problème de base reste: pour des fichiers de taille plus ou 
moins conséquence (c'est d'un pdf de 220 Kb qu'il s'agit dans notre cas), la 
conversion prend beaucoup trop de mémoire. 

N'y a-t-il pas moyen de convertir l'xml (.xdoc) en pdf sans que tout le 
nécessaire ne soit gardé en mémoire?

Original comment by hris...@gmail.com on 29 Oct 2013 at 2:05

GoogleCodeExporter commented 8 years ago

Hi,

We are frensh, but we prefer speaking english in order to many people can 
follow topics about XDocReport.

If I understand you have improved the memory by generating the docx report in a 
temporary file (instead of in a byte array) and after you convert it to pdf.

Perhaps it could be interesting to add this strategy in the Options converter?

After that it seems that you find our docx->pdf converter based on Poi+iText 
uses too memory. 

docx->pdf converter is a very hard task. I had written an article about 
docx->pdf converter at 
http://angelozerr.wordpress.com/2012/12/06/how-to-convert-docxodt-to-pdfhtml-wit
h-java/ to know other Java docx->pdf converter (docx4j and JODConverter).

You tell me if it's possible to convert directly the xml entries to pdf. I 
think the best performance should be use a SAX parser and generates pdf. But 
ooxml is very complex format, so we have decided to use a DOM like to load the 
docx (we use Apache POI). We could do the same think with docx4j, but no time 
to develop that.

An interesting test is to find where memory is used (is just to load docx with 
Apache POI XWPFDocument takes memory?)

Original comment by angelo.z...@gmail.com on 29 Oct 2013 at 10:37

GoogleCodeExporter commented 8 years ago

Unfortunately we cannot give you any further hints as of where the observed 
memory burst comes from other than once again pointing at the convert() method. 
The only additional piece of information at my disposal is that the kind of 
documents generated seems to also matter. We, for instance, face this problem 
for reports containing tables with a list of hundreds (not thousands) of rows. 
The resulting pdfs aren't very large by the way, only a few hundreds Kbs.

I do not have a clear idea of what converting docx to pdfs represents in terms 
of programming but now that you mention DOM parsers, it confirms my suspicions 
about the document to convert taking way more space in memory than reasonable. 
How hard is it to port your implementation to a SAX/StAX-based implementation 
anyway?

Original comment by hris...@gmail.com on 30 Oct 2013 at 3:09

GoogleCodeExporter commented 8 years ago

When I say "DOM-like" it's not a real DOM w3c Document, it's POI XWPFDocument 
that we use. Developping docx->pdf is very very hard and we have taking a lot 
of time to do that.

I have not the courage to restart from scratch our converter with SAX, but if 
you wish to do that, I will be happy to help you.

Regards Angelo

Original comment by angelo.z...@gmail.com on 30 Oct 2013 at 3:17

GoogleCodeExporter commented 8 years ago

Hi,

For your information, I have started a new docx->pdf converter which uses only 
OpenFormatsXML structures and not XWPF POI structure (which uses OpenFormatsXML 
structures).

I think memory will be improved because I can use directly the generated report 
(without creating a byte array) and after I don't use XWPF structures which 
loads the whole XML entries of the docx. I my case I load just the needed XML 
entries.

As I have started it, the converter looses a lot information when it is 
converted into pdf. I must manage table and after it should be cool if you can 
check if it improves the memory.

To test it, you must use the 1.0.4-SNAPSHOT and use 
ConverterTypeVia.OpenXMLFormats instead of ConverterTypeVia.XWPF with 
report.convert.

Original comment by angelo.z...@gmail.com on 1 Nov 2013 at 9:00

GoogleCodeExporter commented 8 years ago

Hi hrisnew,

have you tested with ConverterTypeVia.OpenXMLFormats converter option?

I have seen too that you do every time : 

------------------------------------------------------------------------
IXDocReport report = XDocReportRegistry.getRegistry().loadReport(in, 
TemplateEngineKind.Velocity);
------------------------------------------------------------------------

You must do load the report one time and after you retrieve it from the 
registry. See at 
https://code.google.com/p/xdocreport/wiki/DocxReportingJavaMain#5._Test_Performa
nce for a sample.

If you XDocReport servlet support, it manages that.

Regards Angelo

Original comment by angelo.z...@gmail.com on 7 Nov 2013 at 1:15

sitimoen / xdocreport

Large memory consumption #317