mod_translation_updater.py: Parses escape codes in .tr files incorrectly

rubenwardy commented 8 months ago

Minetest version

Minetest 5.9.0-dev

Summary

The script uses regex to parse translation lines, which is incorrect. Take the following example:

Something@@= Something@@

The script will ignore this line as it thinks the @ is escaping the =, but it is actually adding @. Another issue: the script ignores invalid lines without warning!

The script also assumes that a translation is only ever on one line. This is not the case, given @\n:

A @n newline = Une @
nouvelle ligne

Steps to reproduce

I've written my own Python .tr file parser with unit tests here. Unit tests should be added to mod_translation_updater.py, it's unacceptable to have a parser without tests.

My implemention throws a lot of information away (ie: comments) so couldn't be taken directly, but inspiration can be taken from the approach (which is based on the .cpp code)

appgurueu commented 8 months ago

The current behavior is definitely incorrect. Are you sure that regex is inadequate, as opposed to being inadequately used, though? From a quick glance and the issues you have presented, I currently see no reason why this would need "more than regex" (a context-free or even context-sensitive grammar).

rubenwardy commented 8 months ago

The regex used is naive and incorrect. I think this can be captured by a regular language, but it's better to match the engine parser and do it character by character

Wuzzy2 commented 8 months ago

In my defense: I never claimed this script was validating, so garbage in, garbage out. I've written my own *.tr validator separately, named mtt_check.py at https://codeberg.org/Wuzzy/Minetest_Translation_Tools (feel free to tear it a new one as well lol). Also, this script indeed does make some assumptions about the input, these are documented in the script's README.

That having said, yes, this @@ problem is 100% a bug. Fun fact: I actually did my own tests (not unit tests tho) before submitting the script, but I didn't include them. I thought I tested all escapes pretty extensively. *shrug*. Ah, maybe I didn't test the end-of-line stuff well enough.

I agree that writing the parser in code rather than regex sounds like a better and less error-prone alternative. Although I’m not entirely convinced a regex-solution would be impossible to solve this problem. Maybe a complex regex could do the trick? Python Regex is powerful. But even then, code is probably more robust.

The script also assumes that a translation is only ever on one line.

Yeah, I forgot about that. To be honest, I really hate that feature of the TR language when @n works just fine. We should nuke it, it makes the parser more complex. (I know we can't nuke it due to compat. :-( ) I never saw a single TR file using this.

But in any case, throwing away information is absolutely not an option; preserving both comment texts and their position is important.

y5nw commented 8 months ago

That having said, yes, this @@ problem is 100% a bug. Fun fact: I actually did my own tests (not unit tests tho) before submitting the script, but I didn't include them. I thought I tested all escapes pretty extensively. shrug. Ah, maybe I didn't test the end-of-line stuff well enough.

I agree that writing the parser in code rather than regex sounds like a better and less error-prone alternative. Although I’m not entirely convinced a regex-solution would be impossible to solve this problem. Maybe a complex regex could do the trick? Python Regex is powerful. But even then, code is probably more robust.

I had a similar situation last year when writing Lua code to parse quoted C strings (I forgot the exact purpose). The approach I recall coming up with was to match (\+)(.) (i.e. multiple backslashes followed by a character; IIRC there were cases guarding against backslashes at the end of the file so this was irrelevant) and then unescaping that string based on the number of backslashes matched.

I assume a similar approach can be taken here by matching multiple @s with a following character:

Even number of @ -> the sequence is unescaped to n/2 repetitions of @ and the following character is kept as-is (e.g. @@@@c is unescaped to @@c)
Odd number of @ -> the sequence is unescaped to (n-1)/2 repetitions of @ and the following character is considered part of an escape sequence (e.g. @@@= is unescaped to @=)

minetest / modtools