Closed cicido closed 1 year ago
Want to make it a PR?
also, honestly, if you could explain where you found the bug, that would be great. not many people here are that familiar with the codebase
sorry, there is no bug. The code difference is that your judgment is startswith 1,11,111,11110 and mine is startswtith 10,110,1110,11110
Figured that out based on the bit masks. Is the update better in terms of handling utf-8, though?
my modification has bugs. let 's see your code :
if ((word[i-1] & 0xC0) == 0xC0) {
if word[i-1] startswith 110, 1110, 11110 and your code works well。 but in my code:
if ((word[i-1] & 0xE0) == 0xC0) {
it only matchs words which starts with 110, and will miss 1110, 11110.
thanks!!!
in common.c file: if (i == MAX_STRING_LENGTH - 1 && (word[i-1] & 0x80) == 0x80) { if ((word[i-1] & 0xC0) == 0xC0) { word[i-1] = '\0'; } else if (i > 2 && (word[i-2] & 0xE0) == 0xE0) { word[i-2] = '\0'; } else if (i > 3 && (word[i-3] & 0xF8) == 0xF0) { word[i-3] = '\0'; } }
the modification as follows: if (i == MAX_STRING_LENGTH - 1 && (word[i-1] & 0xC0) == 0x80) { # test 10xx xxxx if ((word[i-1] & 0xE0) == 0xC0) { #test 110x xxxx word[i-1] = '\0'; } else if (i > 2 && (word[i-2] & 0xF0) == 0xE0) { # test 1110 xxxx word[i-2] = '\0'; } else if (i > 3 && (word[i-3] & 0xF8) == 0xF0) { # test 1111 0xxx word[i-3] = '\0'; } }