SAXException when converting docx file

foolo commented 4 years ago

For certain file with lots of tags, an exception occurs on converting:

Steps: ./convert.sh -file notice.docx -srcLang en -tgtLang sv -2.0 (file is attached)

Output:

Oct 24, 2019 11:45:17 AM com.maxprograms.xml.CustomErrorHandler fatalError
SEVERE: 1:250 Element type "p" must be followed by either attribute specifications, ">" or "/>".
Oct 24, 2019 11:45:17 AM com.maxprograms.converters.msoffice.MSOffice2Xliff run
SEVERE: Error converting MS Office file
org.xml.sax.SAXException: [Fatal Error] 1:250 Element type "p" must be followed by either attribute specifications, ">" or "/>".
    at openxliff/com.maxprograms.xml.CustomErrorHandler.fatalError(CustomErrorHandler.java:43)
    at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:181)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.seekCloseOfStartTag(XMLDocumentFragmentScannerImpl.java:1433)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:242)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2710)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:635)
    at openxliff/com.maxprograms.xml.SAXBuilder.build(SAXBuilder.java:89)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.writeSegment(MSOffice2Xliff.java:141)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePara(MSOffice2Xliff.java:386)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:587)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePhrase(MSOffice2Xliff.java:589)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recursePara(MSOffice2Xliff.java:419)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recurse(MSOffice2Xliff.java:283)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.recurse(MSOffice2Xliff.java:285)
    at openxliff/com.maxprograms.converters.msoffice.MSOffice2Xliff.run(MSOffice2Xliff.java:97)
    at openxliff/com.maxprograms.converters.office.Office2Xliff.run(Office2Xliff.java:131)
    at openxliff/com.maxprograms.converters.Convert.run(Convert.java:366)
    at openxliff/com.maxprograms.converters.Convert.main(Convert.java:238)

notice.docx

rmraya commented 4 years ago

You should clean the file with CodeZapper or similar before attempting to convert. A file with too many tags is untranslatable.

foolo commented 4 years ago

Is there any way for the user to know whether the file needs to be cleaned? At least it would be good with an error message, warning message, or some way to programmatically detect it, instead of just a crash. I don't want to put the technical responsibility on the end-user.

Another note is that it does not seem that all tags in tags variable become actual tags in the xliff file. I tried with some different docx file and in one example the maximum size of tags was 70 or something, but there were no segment created with that many 70 tags. The segments still looked OK. So i don't think that too many tags necessarily make the file "untranslatable". In the case with "notice.docx", it's only some junk data that does not contain any actual text, so it will all be hidden in the end.

rmraya commented 4 years ago

Professional translators know when a file needs to be cleaned before translation. It is their job.

You can detect the crash and let the user know that the file needs pre-processing.

Segmentation generates tags. The filter removes unnecessary tags or merges adjacent tags later during processing. The Office filter does a good job post-processing tags.

foolo commented 4 years ago

Ok :) So it depends a bit on what we assume about the user.

But anyway, here's a concrete suggestion: Maybe you saw in the other thread my comment about the size of the UTF private use area, E000 to F8FF (which is used by tags for mapping the tags). Since the private use area area is only 6400 code points, we cannot store more than 6400 tags in tags. So we could add a check In Segmenter.segment(String string) { after String pureText = prepareString(string);

If we add a check something like this:

if (tags.size() > 6400) {
   // throw an exception "document has too many tags, needs pre-processing"
}

then we would get this error message instead of SAXException above, and also we would not need to wait the 20-minutes before the error.

rmraya commented 4 years ago

That extra check will slow processing. Not a good idea.

foolo commented 4 years ago

It is only one if-statement which will only run once every call to Segmenter.segment(String string). (I tried with a 4000-word document and it was called 320 times). This if-statement should take less than a microsecond, so the time penalty should be less than a millisecond even for a huge document.

I will paste the code in its context, for clarity

public String[] segment(String string) {
    if (string == null || string.equals("")) {
        return new String[] {};
    }
    String pureText = prepareString(string);

    // here's my addition             
    if (tags.size() > 6400) {
        // throw an exception "document has too many tags, needs pre-processing"
    }
(...)

rmraya commented 4 years ago

Still a bad idea. It would mask the real cause of the problem.

The stack trace you posted does not mention any issue with extended characters. The tags Hashtable<> uses String for key, not a character.

foolo commented 4 years ago

Correct, it will mask the problem. It will also solve the problem with using actual UTF characters (>= F900) as tag keys. If those UTF letters appear in the actual text, they will disappear from the pureText. So the problem is not about extended characters or not. I think that's handled correctly (as you wrote, Hashtable<> uses String, so high-range surrogate pairs can be used). The problem is that we walk outside the "private use area", of UTF, and into some Asian charatcters etc, (still below FFFF, so we have not reached the extended range yet).

rmraya commented 4 years ago

Crossing character ranges does not happen in real life. This code has been in actual use for 10 years by thousands of translators, most of them working with CJK languages, and that kind of error was never reported.

foolo commented 4 years ago

Ok. That's good to hear. Btw, you mentioned CodeZapper or similar. CodeZapper seems to be a windows-only application, right? Do you know any other similar software?

rmraya commented 4 years ago

Check http://translatortools.net/about.html

foolo commented 4 years ago

Ok, I was thinking about something with linux and mac-support. No point in writing a platform independent CAT tool if the translator must preprocess the doc file with a windows tool :)

rmraya commented 4 years ago

Translator that work in macOS use Microsoft Word to clean their files.

rmraya / OpenXLIFF

SAXException when converting docx file #6