Open rubenwardy opened 9 months ago
The current behavior is definitely incorrect. Are you sure that regex is inadequate, as opposed to being inadequately used, though? From a quick glance and the issues you have presented, I currently see no reason why this would need "more than regex" (a context-free or even context-sensitive grammar).
The regex used is naive and incorrect. I think this can be captured by a regular language, but it's better to match the engine parser and do it character by character
In my defense: I never claimed this script was validating, so garbage in, garbage out. I've written my own *.tr validator separately, named mtt_check.py
at https://codeberg.org/Wuzzy/Minetest_Translation_Tools (feel free to tear it a new one as well lol). Also, this script indeed does make some assumptions about the input, these are documented in the script's README.
That having said, yes, this @@
problem is 100% a bug. Fun fact: I actually did my own tests (not unit tests tho) before submitting the script, but I didn't include them. I thought I tested all escapes pretty extensively. *shrug*. Ah, maybe I didn't test the end-of-line stuff well enough.
I agree that writing the parser in code rather than regex sounds like a better and less error-prone alternative. Although I’m not entirely convinced a regex-solution would be impossible to solve this problem. Maybe a complex regex could do the trick? Python Regex is powerful. But even then, code is probably more robust.
The script also assumes that a translation is only ever on one line.
Yeah, I forgot about that. To be honest, I really hate that feature of the TR language when @n
works just fine. We should nuke it, it makes the parser more complex. (I know we can't nuke it due to compat. :-( ) I never saw a single TR file using this.
But in any case, throwing away information is absolutely not an option; preserving both comment texts and their position is important.
That having said, yes, this
@@
problem is 100% a bug. Fun fact: I actually did my own tests (not unit tests tho) before submitting the script, but I didn't include them. I thought I tested all escapes pretty extensively. shrug. Ah, maybe I didn't test the end-of-line stuff well enough.I agree that writing the parser in code rather than regex sounds like a better and less error-prone alternative. Although I’m not entirely convinced a regex-solution would be impossible to solve this problem. Maybe a complex regex could do the trick? Python Regex is powerful. But even then, code is probably more robust.
I had a similar situation last year when writing Lua code to parse quoted C strings (I forgot the exact purpose). The approach I recall coming up with was to match (\+)(.)
(i.e. multiple backslashes followed by a character; IIRC there were cases guarding against backslashes at the end of the file so this was irrelevant) and then unescaping that string based on the number of backslashes matched.
I assume a similar approach can be taken here by matching multiple @
s with a following character:
@
-> the sequence is unescaped to n/2 repetitions of @
and the following character is kept as-is (e.g. @@@@c
is unescaped to @@c
)@
-> the sequence is unescaped to (n-1)/2 repetitions of @
and the following character is considered part of an escape sequence (e.g. @@@=
is unescaped to @=
)
Minetest version
Summary
The script uses regex to parse translation lines, which is incorrect. Take the following example:
The script will ignore this line as it thinks the
@
is escaping the=
, but it is actually adding@
. Another issue: the script ignores invalid lines without warning!The script also assumes that a translation is only ever on one line. This is not the case, given
@\n
:Steps to reproduce
I've written my own Python .tr file parser with unit tests here. Unit tests should be added to mod_translation_updater.py, it's unacceptable to have a parser without tests.
My implemention throws a lot of information away (ie: comments) so couldn't be taken directly, but inspiration can be taken from the approach (which is based on the .cpp code)