stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

mistakes in ruby tweet tokenizer #124

Closed skondrashov closed 6 years ago

skondrashov commented 6 years ago

This isn't in the repo, but I'm not sure where else to put this. This tokenizer: https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb has at least one clear mistake in it... if this generates any interest I can share a fully debugged version of the tokenizer, but I'll go over this one mistake for now:

.gsub(/#{eyes}#{nose}[)d]+|[)d]+#{nose}#{eyes}/i, "<SMILE>")

This line's second part matches on )-: instead of (-:, and since this comes before the <SADFACE> section, that means every inverted sadface is actually tokenized as a <SMILE>. I'm not sure if the pretrained vectors were made with this mistake, and even if they were it probably doesn't affect too much, but mistakes in the tokenizer can propagate through the whole algorithm so it's kind of worrying to see them.

For this line specifically, I recommend: .gsub(/#{eyes}#{nose}[)D]+|\(+#{nose}#{eyes}/, "<SMILE>") Getting rid of the case insensitivity and adding a capital D to only the left-to-right smiley makes sense here. :-d and D-: are hardly smiles!

I feel silly posting issues about smiley tokenization, but for people trying to use the pretrained twitter vectors who have to use the tokenizer to get matching results, it's not really clear whether to fix the tokenizer, or accept that the pretrained vectors are (slightly) wrong and use the broken tokenizer to match. As far as I understand, using the pretrained vectors makes the best sense with the linked tokenizer, because it was the one that was used during training (strong assumption, please correct me if I'm wrong). I can post fully corrected regex with the list of issues that I've found if this generates interest and is the right place to open this issue (as well as a python version, which I'm sure would be helpful to someone).

skondrashov commented 6 years ago

I realized there are other open issues for this, will add this as a comment instead.