matecat / MateCat-Filters

Convert any file to XLIFF and back with perfectly preserved formatting! Super easy API, plenty of supported formats and advanced segmentation.
http://filters.matecat.com
GNU Lesser General Public License v3.0
45 stars 32 forks source link

Bad segmentation in Arabic project #27

Open uhallac opened 6 years ago

uhallac commented 6 years ago

Can you please check the following Arabic project? https://www.matecat.com/translate/33409docx/ar-SA-en-GB/1116612-3ffd0e90c8f0

The segmentation seems to have failed. Do you think this is a Matecat-Filters issue?

Thank you.

giusilvano commented 6 years ago

Hi @uhallac!

In your document there is no punctuation, so the segmenter has no hints to understand the structure of sentences. Moreover, all the text is in one single paragraph, so morphologically it is correct to not split it into more segments.

Can you please explain me better the segmentation you were expecting?

uhallac commented 6 years ago

Hi @giusilvano,

Where are the tags coming from? I don't see any special characters between words but spaces only. Thank you.

giusilvano commented 6 years ago

I checked in the source file of your project and each word seems to carry an ID related to a past revision-check work. The filters are producing tags to preserve these IDs in the target file. We have to discuss internally if this is useful or not. Can you confirm you used the Word's revisions feature on this text?

uhallac commented 6 years ago

The file was created using only a paragraph from a larger client document with the same issue. Not quite sure if revisions feature was used on it, couldn't detect them in Libre Office editor.

As far as I know Matecat doesn't let such documents get analyzed at all, am I wrong? This restriction by the way is a huge obstacle when using the Matecat API to create projects automatically. Latest revision of a document should be used in such cases in my opinion.

Thank you.

giusilvano commented 6 years ago

You are right, MateCat does not allow files with revisions. Our point on this is that a file with revisions contains a lot of comments and suggestions that must be accepted / rejected by a human in order to have the document in a consistent state. Moreover the implementation of the auto-accept of revisions is really hard!

Anyway this issue requires a fix on the underlying framework, Okapi. I will communicate them the problem, but I can't estimate how long it will take to process it.

uhallac commented 6 years ago

Thank you for the information.