Open lintool opened 11 years ago
We might not even need a service:
if we stored term counts in this way: termid -> [ vector of counts... ] we can definitely post the file publicly, separately distribute term to termid mapping
the format would be pretty much identical to the google books datasets http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
We need to retrun for each term:
It would probably be useful to do this unigrams and bigrams. The size of the file could be reduced by filtering out low frequencies overall the collection, or per 'bucket' period.
We can specify buckets every N hours from the start of the corpus. N = 4/6/12 hours would probably be more than enough. At least with a smaller than necessary interval, people can easily aggregate intervals together as necessary using integer division on the bucket offset.
We would also need the background model of document frequencies in each bucket so we can compute term probabilities as well.
What about tweet and term statistics of the current index. Some IR baslines requires collection statistics such as average tweet length (i.e. Okapi BM25). This is a non-exhaustive list of index stats:
Some of this data is reproducible on client side unless the same tokenizer and stemmer is used.
I defined some Thrift structs for data encoding. Optional fields must be implemented on client side. (see https://github.com/amjedbj/twitter-tools/blob/prototype-lintool/src/main/thrift/twittertools.thrift)
What do you think?
I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)
Any Help?
We are integrating these items into the API currently. They should be included soon.
-Miles
Sent from my iPad
On Jun 26, 2013, at 18:46, Latifa notifications@github.com wrote:
I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)
Any Help?
— Reply to this email directly or view it on GitHub.
We need a service to return time counts within a certain interval. Need to decide: