rmraya / OpenXLIFF

An open source set of Java filters for creating, merging and validating XLIFF 1.2, 2.0 and 2.1 files.
https://www.maxprograms.com/products/openxliff.html
Eclipse Public License 1.0
65 stars 17 forks source link

Performance improvement for convert with -embed and -2.0 #2

Closed foolo closed 5 years ago

foolo commented 5 years ago

I noticed that the -2.0 option in combination with -embed (which is my use case) makes the Convert step takes a very long time. Maybe you are already aware of it, but anyway here comes some measurements and a small investigation. Some examples (all done with the same test.docx, (4.3 MB, 4400 words)):

without -embed flag: ./convert.sh -file test.docx -srcLang da -tgtLang sv -2.0 13 seconds

without -2.0 flag: ./convert.sh -file test.docx -srcLang da -tgtLang sv -embed 13 seconds

with both -2.0 and -embed flag: ./convert.sh -file test.docx -srcLang da -tgtLang sv -2.0 -embed 77 seconds

I did some debugging and it is one particular call to com.maxprograms.xml.Element.mergeText() that takes about 1 minute to complete, and it seems like the bottleneck is this line: https://github.com/rmraya/OpenXLIFF/blob/master/src/com/maxprograms/xml/Element.java#L167

When the mergeText() is run for the <internal-file> element (i.e all the base64 skeleton data), the content member is a big vector (37000 lines in my case) which is then concatenated line by line to a new string: t.setText(t.getText() + ((TextNode) n).getText());

It can probably be improved fairly easily, so that it runs almost instantly, for example by using a StringBuilder for concatenation.

rmraya commented 5 years ago

Suggested change has been implemented.

Overall performance of the XML library is now slightly worse (in most cases, XML elements contain merged text and now there is an extra cost in creating a StringBuilder) but for merging skeletons and converting to XLIFF 2.0 the change may be relevant.

foolo commented 5 years ago

Nice! Now the calls take 15, 14, and 16 seconds instead of the previous 13, 13, and 77, so the result is as predicted. BTW, I needed to change the module name in convert.sh (it was still using the old xliffFilters name).