keeleinstituut / tv-tolkevarav

Tõlkevärav (Translation Hub)
1 stars 0 forks source link

Matecat inline tag issue investigation and estimation (BE) #779

Open MariusJulius opened 4 weeks ago

MariusJulius commented 4 weeks ago

Problem: whenever there is a tag in the XML text, such as superscript 1 or italic, CAT drops it on a separate line. Is there any way we can disable it?

They still have to stay in the same segment. Otherwise, the translation memory becomes completely unreliable. memoq chops them, Trados doesn't.

Example 1 , the XML has: <![CDATA[§ 61. ]]>In the CAT tool it is nicely on one line. Example 2, in XML is: If the shares of a joint-stock company are admitted to trading on a regulated securities market (hereinafter stock-exchange company), participation in the general meeting of such a joint-stock company takes place by electronic means in § 331 of the General Part of the Civil Code Act according to the established procedure.</plain text>

When translating a Word file, the given segmentation topic does not occur, italics etc. are recognized nicely. I am afraid that there are some differences in XML, that when generating XLIFF in the classic CAT tool, the translation software cannot segment the sentence as a whole.

By the way, I tried this .NET framework in memoQ again, this time I got the file imported without errors, but it didn't give the expected result, the segments are still scattered - pictures included. So I myself am not sure that .NET is the answer, but I could be wrong. Trados understands how to keep segments together without selecting filters. I researched a bit about "regex expression" and "regex tagger" etc., but the easiest and most down-to-earth way was to specify and as inline tags when processing XMLs in memoQ. And that settled the matter. Is it possible to do something similar in Matecat?

image image

We need estimation how this could be achieved. Otherwise there will be a lot of complications for user as is large amount of xmls with large volumes.

kadmit commented 3 weeks ago

@MariusJulius @NeleKo

I found out how we can implement a temporary solution:

It can be done by replacing the HTML tags with their HTML-encoded representation, for example: <i> tag can be replaced with &lt;i&gt;

Such replacements make segmentation work as expected:

Image

It can be done using some console tool like sed (otherwise we will need to load the whole file to memory which can be insufficient as it's done during project creation, not in the background).

We can make these replacements before sending files to MateCat filters and after receiving the final files.

Before doing it would be good to get confirmation from @thenouan 🙂

Note: it can also be done manually before uploading files to the system.

NeleKo commented 3 weeks ago

@kadmit Thanks. I think this might be a good temporary solution for now. A better or different solution can be done in dev stages II or III. How does this affect DOCX or other type of files?

kadmit commented 3 weeks ago

@NeleKo We can apply these replacements only for .xml files, so other file types will not be affected.

NeleKo commented 3 weeks ago

One more thing to check. Here is screenshot of the XML conversion working right now: XML convert

Text sample from live RT web: screen

The new proposal CAT:1 I think still something is not working with the XML conversion. As the new change introduced some new issues there we not present before, there are unnecessary line breaks in the sentences.

NeleKo commented 2 weeks ago

One user reported today about an idea regarding XML converstion to XLIFF and the segmentation issue:

CAT-tool creates different tags and/or segment in XLIFF where it is not necessary:

  • Eg. the tag text inside the XML, maybe this is the main problem with the segmentation when creating XLIFF right now?
  • Or maybe and are the main reason the segmentation is not working and unnecessary line breaks cause segmentation issue with full sentences?
  1. See the sample from the user XML file:

Inside XML:

1 Sissesõidukeeldude riikliku registri asutab Vabariigi Valitsus ja registri põhimääruse kehtestab määrusega. ![1](https://github.com/keeleinstituut/tv-tolkevarav/assets/119607967/c91316dc-8e1c-4802-8a0d-3d6a3a360df9) 2. Maybe we can read the superscript and italic tags better, when the format is different (see below sample): ![2](https://github.com/keeleinstituut/tv-tolkevarav/assets/119607967/f460454d-8788-46a7-a900-774a34a38d68) 2 Välismaalase Eestis viibimise seaduslikud alused (edaspidi viibimisalused) sätestab välismaalaste seadus. 2 Käesoleva seaduse § 154 lõikes 1 nimetatud hädaolukorras ja juhul, kui välismaalase asukoht Eestis ei ole Politsei- ja Piirivalveametile teada, võib haldusorgan jätta muud haldusakti kättetoimetamise viisid kohaldamata ning avaldada haldusakti adressaadi isikuandmed ja haldusakti resolutiivosa haldusorgani veebilehel. Haldusakti resolutiivosa haldusorgani veebilehel avaldamisega loetakse haldusakt välismaalasele kättetoimetatuks ja jõustunuks. What would work better?
MariusJulius commented 6 days ago

1) We can implement some quick fix replacements (will see how much can be done) 2) Long term solution needs work with Matecat filters (java), which takes more time (maintenance or phase 2)