stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Parsing Rules for the Glove.42B.300D #200

Open hontimzam opened 2 years ago

hontimzam commented 2 years ago

Hello, I am Tim. I have some questions about the pre-trained vectors in glove.42B.300D.txt

As I am working on some text, and I would like to transfer the text to vectors via the glove.42.B.300D in Python. However, I created my parsing rules/tokenizer (via Spacy library) for my text but the words selected from my own rules is not always fit with the words/vocabulary in the golve.txt.

For example in some texts: "New York is a big city and there are many stores. The items in the stores are non-expensive. There are 5.5-billion peoples in the world"

After the my parsing rules: "new york", "is", "a", "big", "city", "and", "there", "are", "many", "stores", "the", "items", "in", "the", "stores", "are", "non", "expensive", "there", "are", "5.5 billion", "peoples", "in", "the", "world".

However, in the glove.42.B.300D.txt, there are: 1). no "new york" BUT "new-york", 2). contains "non" and "expensive", but also contains "non-expensive" (which is different vectors) 3). even we have hyphens, there are no "5.5-billion", but sometimes contains "9.5-billion", "4.5-billion", etc. 4). Other similar expectation cases.

As a result, only 65% of the words are covered by the library, which is not because there is anything wrong with the dictionary, it is simply because my parsing rules are not good enough. The question is how can I modify the parsing rules such that the words can well-fit the dictionary? is there are already some existing parsing rules?

I have tried to look at the words and fix the issue case by case. However, we cannot ensure that in the future, there will be some new exceptional cases....