lintool / twitter-tools

Twitter Tools
twittertools.cc
218 stars 100 forks source link

Implement service to return term counts #24

Open lintool opened 11 years ago

lintool commented 11 years ago

We need a service to return time counts within a certain interval. Need to decide:

  1. Actual implementation (separate service? squeeze into current service?)
  2. Granularity?
  3. Just unigrams? Arbitrary n-grams as well?
  4. Impact on efficiency?
lintool commented 11 years ago

We might not even need a service:

if we stored term counts in this way: termid -> [ vector of counts... ] we can definitely post the file publicly, separately distribute term to termid mapping

the format would be pretty much identical to the google books datasets http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

amjedbj commented 11 years ago

We need to retrun for each term:

stewhdcs commented 11 years ago

It would probably be useful to do this unigrams and bigrams. The size of the file could be reduced by filtering out low frequencies overall the collection, or per 'bucket' period.

We can specify buckets every N hours from the start of the corpus. N = 4/6/12 hours would probably be more than enough. At least with a smaller than necessary interval, people can easily aggregate intervals together as necessary using integer division on the bucket offset.

We would also need the background model of document frequencies in each bucket so we can compute term probabilities as well.

amjedbj commented 11 years ago

What about tweet and term statistics of the current index. Some IR baslines requires collection statistics such as average tweet length (i.e. Okapi BM25). This is a non-exhaustive list of index stats:

Some of this data is reproducible on client side unless the same tokenizer and stemmer is used.

I defined some Thrift structs for data encoding. Optional fields must be implemented on client side. (see https://github.com/amjedbj/twitter-tools/blob/prototype-lintool/src/main/thrift/twittertools.thrift)

What do you think?

Latifa-AlMarri commented 11 years ago

I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)

Any Help?

milesefron commented 11 years ago

We are integrating these items into the API currently. They should be included soon.
-Miles

Sent from my iPad

On Jun 26, 2013, at 18:46, Latifa notifications@github.com wrote:

I went through the API in the GIT repository and I couldn’t find a code to obtain collection statistics (Example: Term tf, Term idf .. etc)

Any Help?

— Reply to this email directly or view it on GitHub.