Open omitevski opened 8 years ago
I'm going to go with the twitter vocabulary. is that okay? 2.2M vocab might be a bit too excessive... :D
Ok try the twitter first. Although only a fraction of those are in english. The vocabulary consists of many other languages.
@omitevski tried the twitter one, but having difficulty determining whether a word is English or not. also this contains words such as "fridayfeeling" from the hashtag #fridayfeeling. Shall I still use fuzzy matching? It might be worth using an outside corpus such as: http://www.anc.org/data/masc/downloads/data-download/ for spell checking and then get the vectors from the glove data. Or we can just attach the vectors without spell checking.
Ok try that.
For the hash tags, we might need to treat them separately somehow. Some can carry a lot of information, others are simply improperly used.
I'm using spacy for tokenizing the text. It seems very good.
This package provides precomputed word-vectors in addition to the tokenizer.
df.body (contains the tweets from our sample. each row one tweet.)
nlp = spacy.load('en') text = " ".join(df.body) doc = nlp(text)
allwords = pd.DataFrame([(w.orth, w.vector, w.has_vector) for w in doc.vocab]) all_words.columns = ['word', 'wvec', 'has_vec'] all_words = all_words[all_words.has_vec]
In order to use the vectors for the word representations.
http://nlp.stanford.edu/projects/glove/ to be used either the common crawl (840B tokens, 2.2M vocab, cased, 300d ) or Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors)