pablobarbera / twitter_ideology

Estimating Ideological Positions with Twitter Data
GNU General Public License v2.0
212 stars 74 forks source link

Querying followers #2

Closed briatte closed 9 years ago

briatte commented 9 years ago

Dear Pablo,

This question is possibly related to #1 .

In the paper, you "discard from the sample [followers] who 1) have sent fewer than 100 tweets, 2) have not sent one tweet in the past six months, 3) have less than 25 followers, 4) are located outside the borders of the country of interest, and 5) follow less than three political Twitter accounts."

As far I understand, criterion 5) is actually the first one that you apply in the code.

My question is: how do you apply the other criteria? I'm particularly curious about criterion 4).

I'm also curious to know how long it would actually take to retrieve the information for criteria 1–4 for a sample of several hundred thousand users, like the sample you got for the U.S. Since it implies passing ~ 800,000 calls to https://api.twitter.com/1.1/users/show.json, wouldn't that take several thousand hours, even if you have tons of OAuth tokens available (I'm using two)?

Cheers,

Fr.

pablobarbera commented 9 years ago

@briatte Sorry - I'm just seeing this issue now. You may have been to find a better solution for this, but just in case here it goes...

Yes, so the first criterion I apply is number 5, which already gets rid of most of the users. Then I used a variant of getUsersBatch to download the user information for the rest. It uses the lookup.json endpoint, which allows querying users in groups of 100. So let's say you have a million users, then that's 10,000 API calls. The rate limit is 180 calls every 15 minutes, so 720 per hour. If you split the task across multiple tokens, you should be able to get all the user data in less than 12 hours (if my math is not wrong).

And for criterion number 4, it depends how accurate you want to be. I usually exploit the time_zone field in users' profiles, and get rid of users with a time zone outside of the US. (Note that in many case time zone is missing -- I don't get rid of those, just in case).