Fuzzy matching of GloVe vocabulary to Twitter sample vocabulary

omitevski / tweetstream

0 stars 0 forks source link

Fuzzy matching of GloVe vocabulary to Twitter sample vocabulary #2

Open omitevski opened 8 years ago

omitevski commented 8 years ago

In order to use the vectors for the word representations.

http://nlp.stanford.edu/projects/glove/ to be used either the common crawl (840B tokens, 2.2M vocab, cased, 300d ) or Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors)

niczky12 commented 8 years ago

I'm going to go with the twitter vocabulary. is that okay? 2.2M vocab might be a bit too excessive... :D

omitevski commented 8 years ago

Ok try the twitter first. Although only a fraction of those are in english. The vocabulary consists of many other languages.

niczky12 commented 8 years ago

@omitevski tried the twitter one, but having difficulty determining whether a word is English or not. also this contains words such as "fridayfeeling" from the hashtag #fridayfeeling. Shall I still use fuzzy matching? It might be worth using an outside corpus such as: http://www.anc.org/data/masc/downloads/data-download/ for spell checking and then get the vectors from the glove data. Or we can just attach the vectors without spell checking.

omitevski commented 8 years ago

Ok try that.

For the hash tags, we might need to treat them separately somehow. Some can carry a lot of information, others are simply improperly used.

omitevski commented 8 years ago

I'm using spacy for tokenizing the text. It seems very good.

https://spacy.io

omitevski commented 8 years ago

This package provides precomputed word-vectors in addition to the tokenizer.

df.body (contains the tweets from our sample. each row one tweet.)

nlp = spacy.load('en') text = " ".join(df.body) doc = nlp(text)

allwords = pd.DataFrame([(w.orth, w.vector, w.has_vector) for w in doc.vocab]) all_words.columns = ['word', 'wvec', 'has_vec'] all_words = all_words[all_words.has_vec]