There is no large Twitter corpus freely available (or it is very hard to find). Smaller annotated ones are available but they are too small for building an embedding.
I downloaded it and now I'm trying to build an embedding based on 6grams.
Currently all 6grams with frequency count below 50 are discarded, although this threshold may be too strict.
There is no large Twitter corpus freely available (or it is very hard to find). Smaller annotated ones are available but they are too small for building an embedding.
However, the Rovereto Twitter N-Gram Corpus is available here: http://clic.cimec.unitn.it/amac/twitter_ngram/
I downloaded it and now I'm trying to build an embedding based on 6grams. Currently all 6grams with frequency count below 50 are discarded, although this threshold may be too strict.