Closed amjedbj closed 11 years ago
Personally, I need the next fields (sorted by importance): 1- tweet.lang (if available) 2- tweet.retweet_status.id (if retweeted) 3- tweet.retweet_status.user.screenname (if retweeted) 4- tweet.in_replay_to_status_id (if reply) 5- tweet.in_reply_to_screen_name (if reply) 6- tweet.user.followers_count 7- tweet.user.friends_count 8- tweet.retweet_status.reweet_count (if retweeted) 9- tweet.coordinates (lat/log) (if available)
Hi Miles,
I think these are all reasonable fields to throw into the index. I've assigned this task to you? We need to see how big the index becomes...
Thanks!
The list given by amjedbj seems more than enough for our needs, however can I suggest location (including lat/long if available) as something which may be useful.
I would be -1 on location, unless someone needs absolutely needs it.
The problem with location is there are 3 fields
Which field is useful?
lat/long I'd say
Also the lang slot from the tweet (not the user)?
On Wed, Apr 17, 2013 at 1:28 PM, James McMinn notifications@github.comwrote:
The list given by amjedbj seems more than enough for our needs, however can I suggest location (including lat/long if available) as something which may be useful.
— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-16520196 .
Definitely not user.location, that is user fillable from the profile and would not correspond to the tweet location.
On Wed, Apr 17, 2013 at 1:50 PM, Jimmy Lin notifications@github.com wrote:
lat/long I'd say
— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-16521500 .
If this causes issues with the the Lucene index, a viable alternative is to keep it separate in an AWS Dynamo DB, with tables to lookup stats by userid, or by tweetid. Dynamo DB pricing has just been cut to $0.25 per gb/month.
Hashtags, mentions and URLs could be extracted (on client side) from tweet text. I solved this issue https://github.com/lintool/twitter-tools/issues/28.
Though you have retweet count, you do not appear to have the retweet field. We need to see if a particular tweet is a retweet, or which portion of the tweet is a retweet.
yep. i'll do this. in most cases, the data types will be pretty obvious, i assume. i'll ping you if i see ambiguity.
and i will store created_at as a Long, corresponding to Unix epoch. that will take up less space than a string and allow easy computation of things like recency priors. make sense?
On Wed, Apr 17, 2013 at 12:19 PM, Jimmy Lin notifications@github.comwrote:
Hi Miles,
I think these are all reasonable fields to throw into the index. I've assigned this task to you? We need to see how big the index becomes...
Thanks!
— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-16519667 .
Miles Efron Assistant Professor Graduate School of Library and Information Science University of Illinois, Urbana-Champaign
sgtm
@dpmccul that's right, what we need to extract is the retweet count of retweeted status (tweet.retweet_status.reweet_count) . Tweets from streaming API are just published and thus have low chance to be retweeted. @hussam123 comments, followers and friends are not available through Streaming API. You can use Twitter REST API (Rate Limited).
Seems that lang attribute is not available for all tweets in Tweet2013 dataset (see https://dev.twitter.com/blog/introducing-new-metadata-for-tweets)
@lintool @milesefron I updated the list of fields https://github.com/lintool/twitter-tools/issues/26#issuecomment-16518431.
I'd like to have the following also to be indxed as fields:
Sorry if duplicated.
@telsayed If the analyzer (lucene's tokenization approach) is setup correctly (keeping preceding #'s and @'s), then you should be able to perform queries for these through the current index.
@JamesMcMinn is assigned task #23 to develop an appropriate analyzer. I will add your comments to that issue.
In the last version of API specification https://github.com/lintool/twitter-tools/wiki/TREC-2013-API-Specifications status.retweet_status.id and status.retweet_status.user.screenname have been removed.
Even though retweets are consirded irrelevant in the two last editions of TREC microblogs, these fields are helpul for social network based appraoches. I used these two fields in my trec2011 and trec2012 runs.
I'll go ahead and put the retweeted_status.id and retweeted_user_id elements back in the index.
As for the screenname element, do folks need that if we're already exposing the user_id? i assumed having one would be enough. but let me know if not.
On Mon, May 6, 2013 at 4:43 PM, Lamjed Ben Jabeur notifications@github.comwrote:
In the last version of API specification https://github.com/lintool/twitter-tools/wiki/TREC-2013-API-Specifications _status.retweetstatus.id and _status.retweet_status.user.screenname_has been removed.
Even though retweets are consirded irrelevant in the two last editions of TREC microblogs, these fields are helpul for social network based appraoches. I used these two fields in my trec2011 and trec2012 runs.
— Reply to this email directly or view it on GitHubhttps://github.com/lintool/twitter-tools/issues/26#issuecomment-17510187 .
Miles Efron Assistant Professor Graduate School of Library and Information Science University of Illinois, Urbana-Champaign
Any chance we can play with the APIs? ....... Latifa Qatar University
This task has been completed and results have been merged into the trec2013-api branch.
For the moment, some few fields are avaiable through the search API:
What fileds do you need to replicate you run?