piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.72k stars 4.38k forks source link

fasttext ft_hash and unicode handling #2059

Closed leezu closed 5 years ago

leezu commented 6 years ago

Fasttext uses the hashing trick to map ngrams to a an index in [0, N]. Gensim supports loading models trained with original fasttext implementation from facebook research. It is therefore important that both gensims and the original implementation use the same hash function to make sure that ngrams are associated with the correct vectors.

The original implementation is in C++ and considers ngrams as std::basic_string<char>, ie. a sequence of bytes:

uint32_t Dictionary::hash(const std::string& str) const {
  uint32_t h = 2166136261;
  for (size_t i = 0; i < str.size(); i++) {
    h = h ^ uint32_t(str[i]);
    h = h * 16777619;
  }
  return h;
}

However the gensim python implementation consider a ngram as a sequence of unicode characters:

        h = np.uint32(2166136261)
        for c in string:
            h = h ^ np.uint32(ord(c))
            h = h * np.uint32(16777619)
        return h

The two are not equivalent. Consider:

In [1]: import numpy as np

In [2]: def hash_unicode(s):
   ...:     h = np.uint32(2166136261)
   ...:     for c in s:
   ...:         h = h ^ np.uint32(ord(c))
   ...:         h = h * np.uint32(16777619)
   ...:     return h
   ...:
   ...:
   ...: def hash_bytes(s):
   ...:     h = np.uint32(2166136261)
   ...:     s = s.encode('utf-8')
   ...:     for c in s:
   ...:         h = h ^ np.uint32(c)
   ...:         h = h * np.uint32(16777619)
   ...:     return h
   ...:
   ...:

In [3]: hash_unicode("é")
Out[3]: 1812687940

In [4]: hash_bytes("é")
Out[4]: 513665217

I am not really familiar with gensims code, so I may have overlooked something.

I assume that the cython implementation treats the iteration over a unicode string also as an iteration over unicode characters, so the same consideration as above would apply, but I haven't verified this.

Edit: I checked the inputs to the hash function of the original fasttext implementation. There, eg. <α> is represented by 3c ffffffce ffffffb1 3e

menshikh-iv commented 6 years ago

Thanks for report @leezu, looks like a bug for me

piskvorky commented 6 years ago

If this is the case, how can Fasttext models loaded in Gensim even work? Wouldn't the similarity results be basically random for OOV words?

This sounds like a critical issue to me. @jayantj @manneshiva thoughts?

menshikh-iv commented 6 years ago

@piskvorky good question, probably nobody tests it with non-en models.

Anyway, need to check it.

leezu commented 6 years ago

For your reference, this was also not implemented correctly in the fastText C++ implementation. It is fixed there now. More info https://github.com/facebookresearch/fastText/issues/553

menshikh-iv commented 6 years ago

Hi @leezu, I see the fix in FB repo: https://github.com/facebookresearch/fastText/commit/9c9b9b2eca4fdc0182182cb69234fffbb847d820

I don't catch a bit, what we should to do (input can be non-utf8 too), can you give me an advice?

UPD: A std::string is an instantiation of the std::basic_string template with a type of char, in this case

std::string t = "привет";
std::cout << t.size();   // 12 (not 6)

you are right, our implementation are not compatible with FB

leezu commented 6 years ago

@menshikh-iv, essentially it is necessary to make sure something along the lines of the following Python code is used (equivalent to the C++ code you linked in the previous post). I haven't looked into the gensim cython implementation, but at least the pure python version is incorrect (as it wrongly considers the characters but not the bytes composing an ngram).


def _byte_to_int(b):
    return b

if sys.version_info[0] == 2:
    def _byte_to_int(b):
        return ord(b)

word_enc = bytearray((u'<' + word + u'>').encode('utf-8'))
hashes = []
for ngram in ngrams(memoryview(word_enc)):
    h = np.uint32(2166136261)
    for c in ngram:
        h = np.uint32(h ^ np.uint32(np.int8(_byte_to_int(c))))
        h = np.uint32(h * np.uint32(16777619))
    hashes.append(h)
aneesh-joshi commented 6 years ago

@leezu I am trying to fix these issues. What would you say is a good test condition to ensure gensim implementation is compatible with Facebook.

Does: fb_hash("é") == gensim_hash("é") cover all the cases?

Also: I am not used to handling unicode strings(advanced level) so I need help in understanding: Please tell me if I understand correctly:

The current gensim hash function hashes based on the ord aka the ascii value of the characters in the strings. Whereas, facebook encodes based on bytes.

This is leading to different hash values for the same strings.

Ideally, shouldn't changing the gensim hash function to use unicode work? Maybe not, because: python2 and python3 don't treat string as unicode bytes. So we have to convert the python2 to int manually as you have done in the function: _byte_to_int

Do I understand correctly?

Currently, I will try to incorporate your "psuedo" code into gensim.

leezu commented 6 years ago

@aneesh-joshi I suggest you to train a fasttext model using the C++ implementation based on a small test corpus containing unicode symbols (ie. a sentence or a few sentences). Then use again the fasttext C++ code to print out the vectors for all words. Store these vectors. The test case should be, given the binary model, match the vectors computed by the C++ implementation exactly.

The hash function must operate on bytes of unicode text. One unicode character may consist of multiple bytes. This is trivial in C++ but a bit involved in Python due to the difference in Py2 and Py3 and that Python in general abstracts the bytes away and simply handles characters.

You can also check the GluonNLP implementation: https://github.com/dmlc/gluon-nlp/blob/f9be6cd2c3780b3c7e11a1aca189bf8129bc0c0d/gluonnlp/vocab/subwords.py#L171-L275 The pseudo-code above is essentially a simplified version of the implementation in GluonNLP.

aneesh-joshi commented 6 years ago

@leezu @menshikh-iv @piskvorky Update: I was able to reproduce error and fix it:

To reproduce:

I trained C++ Fasttext on a simple .txt file which had unicode and non-unicode characters. This is the file:

This file contains some unicode sentences like привет as provided by Ivan Menshikh. I will also include some others like : é é and ααααααα
αα α ααα α
Let's hope this works.

I ran:

./fasttext skipgram -minCount 0 -input my_test_file.txt -output my_model
# This created a `my_model.bin` and `my_model.vec` file
cat my_model.vec
.
.
.
ααααααα -0.0030363 -0.0027762 . . . -0.0018107 0.001506
.
.

Then:

from gensim.models import FastText
model = FastText.load_fasttext_format("my_model.bin")
print(mode.wv["ααααααα"])
array([ 1.9879919e-03,  1.7319367e-03,  6.5513700e-04,  1.2212876e-03,
         .
         .
       -2.0760519e-03,  9.5942023e-04,  1.8221042e-04,  2.6060001e-04],
      dtype=float32)

This is clearly different from the original FT vector. It gives the same result for non-unicode strings.

To fix

I checked out to my branch with @leezu 's suggested changes Ran the same commands as above.

array([-3.0362648e-03, -2.7762062e-03, -1.8186163e-03,  8.1895100e-04,
       .
       .
        9.9862483e-04, -1.6497656e-03, -1.8107096e-03,  1.5059930e-03],
      dtype=float32)

This gives the same result.

Current problem:

My test case is hanging/going into an infinite loop on some case. I will investigate more. I also will check if this works for py2

menshikh-iv commented 6 years ago

Nice progress @aneesh-joshi :+1:

please test it fully as possible

also, I'm worried that we possibly generate ngrams incorrectly (by the same reason), please check this too. And last thing - have a look, how this affects to performance (comparing to current wrong variant, just for know), maybe better to encode the full string, instead of independent ngrams.

keep me updated!