[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium - Githubissues

rmlockwood / FLExTrans

Machine Translation using FLEx, Apertium, and STAMP

MIT License

10 stars 2 forks source link

[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673

Closed rmlockwood closed 3 weeks ago

rmlockwood commented 3 months ago

Currently FLExTrans doesn't handle reserved characters in Apertium very well. Here's two examples:

If there is an asterisk in a lemma, FT converts it to an underscore. Users have to be aware of this conversion and refer to lemmas with underscores in place of asterisks. By the way asterisks can be common in FLEx lemmas if the user sets the morph type to bound root or bound stem. FLEx automatically prepends the lexeme form with a * to make the headword.
If there is a forward slash that is part of symbol string in the bilingual lexicon. FT converts it to 'SLASH' which again requires the user to use this string in rules.

Other characters may be problematic either in the data stream going into Apertium or in the bilingual lexicon. All of the characters should be identified and the appropriate quoting or converting should be done. Ideally, the user should not have to change how he/she references lemmas or affixes in the rules from what he/she sees in the FLEx lexicon. (Except, of course, the dot to underscore conversion that I don't think we can avoid.)

This work should be done off of a new branch from master.

mr-martian commented 2 months ago

apertium-transfer doesn't actually care about the presence of *, but lt-proc does and simply escaping doesn't work currently (though I could probably make it work)
If we put \/ in the stream, the rules can just refer to a/b without issue.

mr-martian commented 2 months ago

The annoying part is that they only need to be escaped inside <test>. In <def-cat> and <out> the unescaped versions are fine.

mr-martian commented 2 months ago

Since we're mangling the files anyway, I suppose I could add a step to escape all the symbols that need escaping in the transfer file before running it. Then the user probably wouldn't need to escape anything at all.

rmlockwood commented 2 months ago

Slash in the biling. lex. is now working, but asterisk doesn't seem to be. I updated the code in three places in the reserved-charactes branch to not change to _. So now if you test with German-Swedish Reserved characters you should get lieb1.1 in the biling. lex. and Apertium isn't translating it to älska1.1.

mr-martian commented 2 months ago

The bilingual.dix file in that project hasn't been regenerated and still has _.

rmlockwood commented 2 months ago

I changed it to use * in the Utils code. It still doesn't work. Please try running the Build Bilingual Lexicon module yourself with the latest code on the reserved-characters branch.

mr-martian commented 2 months ago

After 43c6f71 it works for me.

rmlockwood commented 1 month ago

Apertium tools not working when another symbol follows a symbol with a slash. I have the following in my biling. lex.: <e><p><l>*lobwana1.1<s n="n" /><s n="1/2" /></l><r>*lopwana1.1<s n="n" /><s n="1/2" /></r></p></e> I have this in my source text: ^\*lobwana1.1<n><1/2><x>$ I get this result (no rules applied): ^*lopwana1.1<n><1<x>$

If the source text doesn't have the <x>, it works fine.

mr-martian commented 1 month ago

The problem there is in lt-proc and there's a fix in https://github.com/apertium/lttoolbox/pull/185

rmlockwood commented 3 weeks ago

Fixed in PR #726