For Indian Languages: Source text gets stored in a weird fashion in XLIFF format

rmraya / OpenXLIFF

An open source set of Java filters for creating, merging and validating XLIFF 1.2, 2.0 and 2.1 files.

https://www.maxprograms.com/products/openxliff.html

Eclipse Public License 1.0

65 stars 17 forks source link

For Indian Languages: Source text gets stored in a weird fashion in XLIFF format #7

Closed vinayaksharmagh closed 3 years ago

vinayaksharmagh commented 3 years ago

For Indian languages like Hindi, Sanskrit etc., apart from the original text, "source" field contains metadata after each word. This is unusual and doesn't happen in case of western languages like English, French German etc. This is problematic because CAT tools present the source field as is in the source language columns.

Screenshot (211)

I am attaching the original text file and the converted XLIFF files as well OpenXLIFF.zip

Update: This is happening with OFF files but not in case of TEXT files.

rmraya commented 3 years ago

Hi,

The problem you face is quite common with Word documents created from scanned PDF, it is not something specific to Hindi or Sanskrit. Word tends to introduce too many font changes in the text to optimize text appearance. Documents look good for printing, but they are bad for translating.

You need to prepare your Word documents before converting them to XLIFF. There are several tools that can clear the excess of tags that your document has (search "CodeZapper tags").

Thank you for the sample files. I'll try to improve the code using them. Don't expect miracles, though.

Rodolfo M. Raya

vinayaksharmagh commented 3 years ago

Thanks for the explanation! I will try the technique that you have mentioned.

rmraya commented 3 years ago

Current code produces cleaner XLIFF from your Hindi file. For the Sanskrit sample, there are still many tags, but simpler ones. If you set all text to the same font output is clean.