stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

utf-8 bug #217

Closed cicido closed 1 year ago

cicido commented 1 year ago

in common.c file: if (i == MAX_STRING_LENGTH - 1 && (word[i-1] & 0x80) == 0x80) { if ((word[i-1] & 0xC0) == 0xC0) { word[i-1] = '\0'; } else if (i > 2 && (word[i-2] & 0xE0) == 0xE0) { word[i-2] = '\0'; } else if (i > 3 && (word[i-3] & 0xF8) == 0xF0) { word[i-3] = '\0'; } }

the modification as follows: if (i == MAX_STRING_LENGTH - 1 && (word[i-1] & 0xC0) == 0x80) { # test 10xx xxxx if ((word[i-1] & 0xE0) == 0xC0) { #test 110x xxxx word[i-1] = '\0'; } else if (i > 2 && (word[i-2] & 0xF0) == 0xE0) { # test 1110 xxxx word[i-2] = '\0'; } else if (i > 3 && (word[i-3] & 0xF8) == 0xF0) { # test 1111 0xxx word[i-3] = '\0'; } }

AngledLuffa commented 1 year ago

Want to make it a PR?

also, honestly, if you could explain where you found the bug, that would be great. not many people here are that familiar with the codebase

cicido commented 1 year ago

sorry, there is no bug. The code difference is that your judgment is startswith 1,11,111,11110 and mine is startswtith 10,110,1110,11110

AngledLuffa commented 1 year ago

Figured that out based on the bit masks. Is the update better in terms of handling utf-8, though?

cicido commented 1 year ago

my modification has bugs. let 's see your code : if ((word[i-1] & 0xC0) == 0xC0) {
if word[i-1] startswith 110, 1110, 11110 and your code works well。 but in my code: if ((word[i-1] & 0xE0) == 0xC0) { it only matchs words which starts with 110, and will miss 1110, 11110. thanks!!!