lojban / jbofihe

Parser & analyser for Lojban
GNU General Public License v2.0
33 stars 3 forks source link

vlatai is not UTF-8 safe under certain specific conditions #13

Closed rlpowell closed 4 years ago

rlpowell commented 4 years ago

Behold the following bizarre mess:

rlpowell@stodi> echo "随分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きがあ" | iconv -f UTF-8
随分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きがあ
rlpowell@stodi> echo "随分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きがあ" | vlatai | iconv -f UTF-8
随分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きがiconv: illegal input sequence at position 126

However, if you trim the string by one character, in either direction, it's fine:

rlpowell@stodi> echo "随分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きが" | vlatai | iconv -f UTF-8
随分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きが : UNMATCHED : 随分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きが
rlpowell@stodi> echo "分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きがあ" | vlatai | iconv -f UTF-8
分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きがあ : UNMATCHED : 分前に同じ問題点についてメーリスに投稿したのですが、そのころはまだ言語改造の動きがあ

How the hell does that work?, I hear you cry?

Once I realized that it was tied to length, it occurred to me that vlatai probably doesn't output the entire string, and in morf.c we have:

printf("%-25s : UNMATCHED : %s\n", s, s);

, and another similar %-25s line just below it.

Unfortunately, I haven't the slightest idea how to make this safe in C, besides just not trimming the input at all.

johnwcowan commented 4 years ago

The fix_utf8 function at https://gist.github.com/w-vi/67fe49106c62421992a2 if given a buffer and its length will return the longest length that consists solely of UTF-8 characters, excluding any partial character at the end of the buffer. That should solve the problem.

rlpowell commented 4 years ago

Turns out to be not related to the %-25s issue it all; "%-25s" will only lengthen a string, not trim it. The actual issue is:

 int main (int argc, char **argv) {/*{{{*/
  char buffer[128];
  char *start[256], **pstart;

I do not have any interest in the effort required to fix this properly, so I'm just pushing a bunch of bigger char arrays.

rlpowell commented 4 years ago

Specifically: the problem is that in the utf-8 string in question, breaking it into 128 byte chunks isn't on a utf-8 character boundary.

rlpowell commented 4 years ago

Cleaned up in https://github.com/lojban/jbofihe/commit/398bbc0edf8a1881eca97e591209096f43464a54