Closed leezu closed 5 years ago
Thanks for report @leezu, looks like a bug for me
If this is the case, how can Fasttext models loaded in Gensim even work? Wouldn't the similarity results be basically random for OOV words?
This sounds like a critical issue to me. @jayantj @manneshiva thoughts?
@piskvorky good question, probably nobody tests it with non-en models.
Anyway, need to check it.
For your reference, this was also not implemented correctly in the fastText C++ implementation. It is fixed there now. More info https://github.com/facebookresearch/fastText/issues/553
Hi @leezu, I see the fix in FB repo: https://github.com/facebookresearch/fastText/commit/9c9b9b2eca4fdc0182182cb69234fffbb847d820
I don't catch a bit, what we should to do (input can be non-utf8 too), can you give me an advice?
UPD: A std::string is an instantiation of the std::basic_string template with a type of char
, in this case
std::string t = "привет";
std::cout << t.size(); // 12 (not 6)
you are right, our implementation are not compatible with FB
@menshikh-iv, essentially it is necessary to make sure something along the lines of the following Python code is used (equivalent to the C++ code you linked in the previous post). I haven't looked into the gensim cython implementation, but at least the pure python version is incorrect (as it wrongly considers the characters but not the bytes composing an ngram).
def _byte_to_int(b):
return b
if sys.version_info[0] == 2:
def _byte_to_int(b):
return ord(b)
word_enc = bytearray((u'<' + word + u'>').encode('utf-8'))
hashes = []
for ngram in ngrams(memoryview(word_enc)):
h = np.uint32(2166136261)
for c in ngram:
h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
h = np.uint32(h * np.uint32(16777619))
hashes.append(h)
@leezu I am trying to fix these issues. What would you say is a good test condition to ensure gensim implementation is compatible with Facebook.
Does:
fb_hash("é") == gensim_hash("é")
cover all the cases?
Also: I am not used to handling unicode strings(advanced level) so I need help in understanding: Please tell me if I understand correctly:
The current gensim hash function hashes based on the ord
aka the ascii
value of the characters in the strings.
Whereas, facebook encodes based on bytes.
This is leading to different hash values for the same strings.
Ideally, shouldn't changing the gensim hash function to use unicode work?
Maybe not, because: python2 and python3 don't treat string as unicode bytes. So we have to convert the python2 to int manually as you have done in the function: _byte_to_int
Do I understand correctly?
Currently, I will try to incorporate your "psuedo" code into gensim.
@aneesh-joshi I suggest you to train a fasttext model using the C++ implementation based on a small test corpus containing unicode symbols (ie. a sentence or a few sentences). Then use again the fasttext C++ code to print out the vectors for all words. Store these vectors. The test case should be, given the binary model, match the vectors computed by the C++ implementation exactly.
The hash function must operate on bytes of unicode text. One unicode character may consist of multiple bytes. This is trivial in C++ but a bit involved in Python due to the difference in Py2 and Py3 and that Python in general abstracts the bytes away and simply handles characters.
You can also check the GluonNLP implementation: https://github.com/dmlc/gluon-nlp/blob/f9be6cd2c3780b3c7e11a1aca189bf8129bc0c0d/gluonnlp/vocab/subwords.py#L171-L275 The pseudo-code above is essentially a simplified version of the implementation in GluonNLP.
@leezu @menshikh-iv @piskvorky Update: I was able to reproduce error and fix it:
I trained C++ Fasttext on a simple .txt file which had unicode and non-unicode characters. This is the file:
This file contains some unicode sentences like привет as provided by Ivan Menshikh. I will also include some others like : é é and ααααααα
αα α ααα α
Let's hope this works.
I ran:
./fasttext skipgram -minCount 0 -input my_test_file.txt -output my_model
# This created a `my_model.bin` and `my_model.vec` file
cat my_model.vec
.
.
.
ααααααα -0.0030363 -0.0027762 . . . -0.0018107 0.001506
.
.
Then:
from gensim.models import FastText
model = FastText.load_fasttext_format("my_model.bin")
print(mode.wv["ααααααα"])
array([ 1.9879919e-03, 1.7319367e-03, 6.5513700e-04, 1.2212876e-03,
.
.
-2.0760519e-03, 9.5942023e-04, 1.8221042e-04, 2.6060001e-04],
dtype=float32)
This is clearly different from the original FT vector. It gives the same result for non-unicode strings.
I checked out to my branch with @leezu 's suggested changes Ran the same commands as above.
array([-3.0362648e-03, -2.7762062e-03, -1.8186163e-03, 8.1895100e-04,
.
.
9.9862483e-04, -1.6497656e-03, -1.8107096e-03, 1.5059930e-03],
dtype=float32)
This gives the same result.
My test case is hanging/going into an infinite loop on some case. I will investigate more. I also will check if this works for py2
Nice progress @aneesh-joshi :+1:
please test it fully as possible
also, I'm worried that we possibly generate ngrams incorrectly (by the same reason), please check this too. And last thing - have a look, how this affects to performance (comparing to current wrong variant, just for know), maybe better to encode the full string, instead of independent ngrams.
keep me updated!
Fasttext uses the hashing trick to map ngrams to a an index in [0, N]. Gensim supports loading models trained with original fasttext implementation from facebook research. It is therefore important that both gensims and the original implementation use the same hash function to make sure that ngrams are associated with the correct vectors.
The original implementation is in C++ and considers ngrams as
std::basic_string<char>
, ie. a sequence of bytes:However the gensim python implementation consider a ngram as a sequence of unicode characters:
The two are not equivalent. Consider:
I am not really familiar with gensims code, so I may have overlooked something.
I assume that the cython implementation treats the iteration over a
unicode
string also as an iteration over unicode characters, so the same consideration as above would apply, but I haven't verified this.Edit: I checked the inputs to the hash function of the original fasttext implementation. There, eg.
<α>
is represented by3c ffffffce ffffffb1 3e