Newline lost in conversion cycle for some cases

rmraya / OpenXLIFF

An open source set of Java filters for creating, merging and validating XLIFF 1.2, 2.0 and 2.1 files.

https://www.maxprograms.com/products/openxliff.html

Eclipse Public License 1.0

65 stars 17 forks source link

Newline lost in conversion cycle for some cases #3

Closed foolo closed 5 years ago

foolo commented 5 years ago

For certain cases, when running convert + merge, without doing any changes to the XLIFF file, the output document is different from the input, in that a newline is lost. I'm wondering whether it is expected behavior when using default.srx, or a bug.

See attached example files: input: test3.docx

Test?
Example.

output: test3_sv.docx

Test?Example.

Steps: ./convert.sh -file test3.docx -srcLang en -tgtLang sv -2.0 -embed ./merge.sh -xliff test3.docx.xlf -target test3_sv.docx

Version used: latest master branch, bddd767 The provided default.srx was used (but as far as I understand, the segmentation should not affect whether the output and input docs become similar(?))

rmraya commented 5 years ago

The sample file was not generated with Microsoft Word. It has markup that does not follow the right paragraph structure expected in a Word file. The MS Office filter cleans the extra tags and normalize content before extracting text.

Conversion works as designed.

foolo commented 5 years ago

Ok, I understand. While the attached file is just an example, I actually found this out from a real-life file that was sent from a client for translation, so it seems like this kind of format is used. Perhaps the client, or the client's client was using LibreOffice. So from OpenXLIFF's point of view, these non-microsoft docx files are not supported, right? (Which means that in practice the CAT tool must check whether the docx is generated by MS Word, before sending it to Convert)

rmraya commented 5 years ago

If the file is from LibreOffice, save in OpenOffice format (*.odt) and convert.

foolo commented 5 years ago

OK, thanks, so basically the CAT tool needs to detect whether the file is saved with LibreOffice (the translator probably doesn't know whether their client used LibreOffice or not) and then instruct the translator to re-save it with LibreOffice, which they might not have installed locally, so it's a bit of a hassle for the user. I hope I can figure out some clever solution to it :)

rmraya commented 5 years ago

Translators can tell when there is an issue and fix the source file accordingly. It is not a task of the CAT tool to check the format.

foolo commented 5 years ago

OK! BTW, how did you determine that the file was from LibreOffice? I should check the real-life file so that I'm sure that it's actually a LibreOffice-saved file.

rmraya commented 5 years ago

I unzipped the file and checked "document.xml". The file has embedded HTML markup for splitting paragraph instead of regular Office XML paragraph tags. That's why I said it wasn't produced by Microsoft Word.

You said the file was from LibreOffice, something I later verified by looking at file properties on Windows.

foolo commented 5 years ago

I see. Now I tried to create a docx file from office.com (which should produce a real docx file i think) , and compared its document.xml with the one from the LibreOffice file (test3.docx). But I could not see what the difference is :) Anyway, thanks for your answers! (I attached the office.com file here, if you are interested: Document1.docx )

rmraya commented 5 years ago

The new file produces two segments instead of one. It has different structure...

foolo commented 5 years ago

Ah, maybe I understand you now. When you wrote "The file has embedded HTML markup for splitting paragraph instead of regular Office XML paragraph tags", did you refer to that test3.docx separates "Test?" from "Example." using a "" tag, instead of starting a new paragraph with ?

rmraya commented 5 years ago

yes

foolo commented 5 years ago

OK, then it's clear!