Possible performance improvement for Segmenter.segment

rmraya / OpenXLIFF

An open source set of Java filters for creating, merging and validating XLIFF 1.2, 2.0 and 2.1 files.

https://www.maxprograms.com/products/openxliff.html

Eclipse Public License 1.0

65 stars 17 forks source link

Possible performance improvement for Segmenter.segment #5

Closed foolo closed 4 years ago

foolo commented 4 years ago

Hi. I noticed that for a particular docx file, the convert process takes a long time (about 20 minutes on my computer). The file is not huge (258kB), but it probably has some unusually large section. I have attached the file. I have only seen the problem on this file, but I still thought that it would be good to report it. The file is notice.docx Example: ./convert.sh -file notice.docx -srcLang en -tgtLang sv -2.0

I have pinned down the bottleneck to Segmenter.segment, in which the time consumer seems to be the calls to hideTags(pureText.substring(...));. The length of pureText was 16681 in this case.

So there are three levels of nested loops: for-loop over pureText, while-loop in hideTags(), and string.substring() in the while-loop. Each loop level has the length of 16681 in this case, which explains the high time consumption.

Perhaps it is possible to solve with StringBuilder or in some other way changing the string handling.

rmraya commented 4 years ago

The file looks like an extract of a PDF generated with AN OCR. If that's the case, it must be preprocessed with CodeZapper or a similar cleaning tool before converting to XLIFF.

Segmenter code must check character by character for breaking points. It is an SRX requirement that can't be changed.

There is noting to change in the code at this moment.

foolo commented 4 years ago

Thanks for your reply! I understand that this file is special. You are probably correct that it is a PDF extract. It comes from a real-world example. (The translator gets the file from a client, doesn't know how the file was originally created, imports it and it takes a long time (although they will probably just think that the program has hanged)). I understand that the SRX rules must be followed. I'm not suggesting to change the resulting function of the code, only the internal string handling, so that it does the same thing more efficiently. I could try to create a suggestion for a solution myself, and create a pull request if you will be willing to look at it.

foolo commented 4 years ago

Or, if it's not possible to improve the performance, perhaps a sanity-check, and exit with a message if the size is too big.

rmraya commented 4 years ago

If you prepare a pull request that improves performance I'll certainly review and accept it.

The size warning is not a good idea, as there may be times when it is really necessary to wait for splitting a long text.

foolo commented 4 years ago

OK, I'll look at it. It seems tricky, as you say, because the quadratic complexity seems to be inherent to the problem itself, but it might still be possible to speed up the implementation. You are right that a simple size-check is not the best idea. If anything, it should be more like a confirmation question.

foolo commented 4 years ago

When I let the above (convert notice.docx) command run to the end (without any changes in the code), it seems that an actual bug occurs. I created a new issue for it. https://github.com/rmraya/OpenXLIFF/issues/6

I'm not sure whether it is related to this other potential problem, which is that the UTF private use area (which is used for the keys of tags variable in Segment.java: https://github.com/rmraya/OpenXLIFF/blob/master/src/com/maxprograms/segmenter/Segmenter.java#L203 ) only has 6400 codepoints (0xE000 to 0xF8FF) which is too few for the 16681 tags that are created in the example above. So it will start using code points from subsequent areas, which are actual UTF characters, which might cause problems if the pureTextcontains some of those characters.

rmraya commented 4 years ago

Get latest release.