juditacs / semeval

MathLing Budapest Team's repo
MIT License
10 stars 9 forks source link

Twitter embedding #30

Open juditacs opened 9 years ago

juditacs commented 9 years ago

There is no large Twitter corpus freely available (or it is very hard to find). Smaller annotated ones are available but they are too small for building an embedding.

However, the Rovereto Twitter N-Gram Corpus is available here: http://clic.cimec.unitn.it/amac/twitter_ngram/

I downloaded it and now I'm trying to build an embedding based on 6grams. Currently all 6grams with frequency count below 50 are discarded, although this threshold may be too strict.