nathanshartmann / portuguese_word_embeddings

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
GNU General Public License v3.0
236 stars 35 forks source link

Invalid words in similarity results #7

Closed tramontini closed 1 year ago

tramontini commented 5 years ago

Hello! In some similarity consults there are invalid words appearing: model_glove.most_similar("folha") results in ['s.paulo', 'reportagem', 'jornal', 'colunista', 'fhc', 'coluna', 'folhas', "'", 'entrevista', 'apurou']. In this case we got an invalid '.

The same occurs in model_glove.most_similar("folha") that results in [ "pré-seminário", "slamdance", "gressy", "março.", "audiãªncias", "biłgorajski", "vietnã£.", "direkt", "havaã\u00ad.", "clasico"]

Do you think that it is some pre-processing treatment missing? Or anything else?

File: glove_s300.txt

Regards!

nathanshartmann commented 5 years ago

We dealt with tons of data from several sources. We were not able to ensure that all their encoding was utf-8, for example. That's why we can find so many strange tokens. The same occurs in English embedding models although their language is simpler than ours.

I intend to extend our models with more corpora and also a more robust preprocessing step to remove bad data. Unfortunately, it's not going to happen soon :(

For now, I suggest you filter short synonym tokens like those whose length is 1. Also, a simple solution to remove strange synonyms is trying converting them to utf-8 (like using Python encoding library). It will raise an error when a token has "wrong encoding".

ruanchaves commented 4 years ago

False positive probability for word and sentence embeddings is usually higher than for BM25. As it has been suggested elsewhere ( and also in this paper ), you can generate a set of candidates with BM25 and then use the word embeddings to rerank them.