lintool / twitter-tools

Twitter Tools
twittertools.cc
218 stars 100 forks source link

What fields do you need to replicate your run? #26

Closed amjedbj closed 11 years ago

amjedbj commented 11 years ago

For the moment, some few fields are avaiable through the search API:

What fileds do you need to replicate you run?

amjedbj commented 11 years ago

Personally, I need the next fields (sorted by importance): 1- tweet.lang (if available) 2- tweet.retweet_status.id (if retweeted) 3- tweet.retweet_status.user.screenname (if retweeted) 4- tweet.in_replay_to_status_id (if reply) 5- tweet.in_reply_to_screen_name (if reply) 6- tweet.user.followers_count 7- tweet.user.friends_count 8- tweet.retweet_status.reweet_count (if retweeted) 9- tweet.coordinates (lat/log) (if available)

lintool commented 11 years ago

Hi Miles,

I think these are all reasonable fields to throw into the index. I've assigned this task to you? We need to see how big the index becomes...

Thanks!

JamesMcMinn commented 11 years ago

The list given by amjedbj seems more than enough for our needs, however can I suggest location (including lat/long if available) as something which may be useful.

lintool commented 11 years ago

I would be -1 on location, unless someone needs absolutely needs it.

amjedbj commented 11 years ago

The problem with location is there are 3 fields

Which field is useful?

lintool commented 11 years ago

lat/long I'd say

isoboroff commented 11 years ago

Also the lang slot from the tweet (not the user)?

On Wed, Apr 17, 2013 at 1:28 PM, James McMinn notifications@github.comwrote:

The list given by amjedbj seems more than enough for our needs, however can I suggest location (including lat/long if available) as something which may be useful.

— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-16520196 .

isoboroff commented 11 years ago

Definitely not user.location, that is user fillable from the profile and would not correspond to the tweet location.

On Wed, Apr 17, 2013 at 1:50 PM, Jimmy Lin notifications@github.com wrote:

lat/long I'd say

— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-16521500 .

stewhdcs commented 11 years ago

If this causes issues with the the Lucene index, a viable alternative is to keep it separate in an AWS Dynamo DB, with tables to lookup stats by userid, or by tweetid. Dynamo DB pricing has just been cut to $0.25 per gb/month.

amjedbj commented 11 years ago

Hashtags, mentions and URLs could be extracted (on client side) from tweet text. I solved this issue https://github.com/lintool/twitter-tools/issues/28.

dpmccul commented 11 years ago

Though you have retweet count, you do not appear to have the retweet field. We need to see if a particular tweet is a retweet, or which portion of the tweet is a retweet.

milesefron commented 11 years ago

yep. i'll do this. in most cases, the data types will be pretty obvious, i assume. i'll ping you if i see ambiguity.

and i will store created_at as a Long, corresponding to Unix epoch. that will take up less space than a string and allow easy computation of things like recency priors. make sense?

On Wed, Apr 17, 2013 at 12:19 PM, Jimmy Lin notifications@github.comwrote:

Hi Miles,

I think these are all reasonable fields to throw into the index. I've assigned this task to you? We need to see how big the index becomes...

Thanks!

— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-16519667 .

Miles Efron Assistant Professor Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

lintool commented 11 years ago

sgtm

hussam123 commented 11 years ago
hussam123 commented 11 years ago
amjedbj commented 11 years ago

@dpmccul that's right, what we need to extract is the retweet count of retweeted status (tweet.retweet_status.reweet_count) . Tweets from streaming API are just published and thus have low chance to be retweeted. @hussam123 comments, followers and friends are not available through Streaming API. You can use Twitter REST API (Rate Limited).

Seems that lang attribute is not available for all tweets in Tweet2013 dataset (see https://dev.twitter.com/blog/introducing-new-metadata-for-tweets)

@lintool @milesefron I updated the list of fields https://github.com/lintool/twitter-tools/issues/26#issuecomment-16518431.

telsayed commented 11 years ago

I'd like to have the following also to be indxed as fields:

Sorry if duplicated.

stewhdcs commented 11 years ago

@telsayed If the analyzer (lucene's tokenization approach) is setup correctly (keeping preceding #'s and @'s), then you should be able to perform queries for these through the current index.

@JamesMcMinn is assigned task #23 to develop an appropriate analyzer. I will add your comments to that issue.

amjedbj commented 11 years ago

In the last version of API specification https://github.com/lintool/twitter-tools/wiki/TREC-2013-API-Specifications status.retweet_status.id and status.retweet_status.user.screenname have been removed.

Even though retweets are consirded irrelevant in the two last editions of TREC microblogs, these fields are helpul for social network based appraoches. I used these two fields in my trec2011 and trec2012 runs.

milesefron commented 11 years ago

I'll go ahead and put the retweeted_status.id and retweeted_user_id elements back in the index.

As for the screenname element, do folks need that if we're already exposing the user_id? i assumed having one would be enough. but let me know if not.

On Mon, May 6, 2013 at 4:43 PM, Lamjed Ben Jabeur notifications@github.comwrote:

In the last version of API specification https://github.com/lintool/twitter-tools/wiki/TREC-2013-API-Specifications _status.retweetstatus.id and _status.retweet_status.user.screenname_has been removed.

Even though retweets are consirded irrelevant in the two last editions of TREC microblogs, these fields are helpul for social network based appraoches. I used these two fields in my trec2011 and trec2012 runs.

— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-17510187 .

Miles Efron Assistant Professor Graduate School of Library and Information Science University of Illinois, Urbana-Champaign

Latifa-AlMarri commented 11 years ago

Any chance we can play with the APIs? ....... Latifa Qatar University

lintool commented 11 years ago

This task has been completed and results have been merged into the trec2013-api branch.