ptwobrussell / Mining-the-Social-Web-2nd-Edition

The official online compendium for Mining the Social Web, 2nd Edition (O'Reilly, 2013)
http://bit.ly/135dHfs
Other
2.9k stars 1.49k forks source link

9.4 + 9.7 Twitter Search + Saving to MongoDB #212

Open curtiswallen opened 10 years ago

curtiswallen commented 10 years ago

For some reason, no matter what value I pass for max_results it always collects 200 tweets, no more, no less.

Code:

import twitter
import json
import io
import pymongo

def oauth_login():

    CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXx'
    CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
    OAUTH_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
    OAUTH_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

    auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                               CONSUMER_KEY, CONSUMER_SECRET)

    twitter_api = twitter.Twitter(auth=auth)
    return twitter_api

def twitter_search(twitter_api, q, max_results=1000, **kw):

    search_results = twitter_api.search.tweets(q=q, count=100, **kw)

    statuses = search_results['statuses']

    max_results = min(1000, max_results)
    tweet_count = 0

    for _ in range(10):
        try:
            next_results = search_results['search_metadata']['next_results']
        except KeyError, e: 
            break

        kwargs = dict([ kv.split('=') 
                        for kv in next_results[1:].split("&") ])

        search_results = twitter_api.search.tweets(**kwargs)
        statuses += search_results['statuses']

        tweet_count += 100
        print tweet_count

        if len(statuses) > max_results: 
            break

    return statuses

def save_to_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):

    client = pymongo.MongoClient(**mongo_conn_kw)
    db = client[mongo_db]
    coll = db[mongo_db_coll]

    return coll.insert(data)

twitter_api = oauth_login()
print "Authed to Twitter. Searching now..."

q = "#ISIS"
results = twitter_search(twitter_api, q, max_results=1000)
print "Results retrieved. Saving to MongoDB..." 

save_to_mongo(results, 'search_results', q)

In the terminal I get:

Authed to Twitter. Searching now...
100
200
Results retrieved. Saving to MongoDB...

Then when I check the DB, 200 results. Every time. I've tried passing "10" for max_results, still 200. I've tried passing "1000" for max_results (as shown), still 200.

Thoughts?

ptwobrussell commented 10 years ago

In terms of why you never get more than 200 results, it is entirely possible that Twitter is limiting the search results to 200 as maximum value at this point in time (all subject to their platform operational capacity.) Per their own API docs[1], the code looks for the 'next_results' node in the response and bails out when it doesn't find it, since that's the way you're supposed to navigate to the next batch of results.

In terms of why you always get 200 results instead of fewer results (say, 10 results or 100 results, or 142 results as specified by the max_results parameter), I just noticed that the twitter_search function returns results, which should technically have been written as results[:max_results] so as to slice off just what you asked for instead of returning whatever it happened to get (which is optimized for max volume.)

Does that help? So, in the former, I think it's just a current (possibly semi-permanent -- who knows?) limitation of the Search API where Twitter has been known to adjust API responses as needed to maintain platform performance. I can't see a problem with the code as written, though maybe I just have a blind spot... In the latter, it's a mostly harmless bug where the list slice is missing.

[1] https://dev.twitter.com/docs/api/1.1/get/search/tweets

On Aug 10, 2014, at 11:45 AM, curtiswallen notifications@github.com wrote:

For some reason, no matter what value I pass for max_results it always collects 200 tweets, no more, no less.

Code:

import twitter import json import io import pymongo

def oauth_login():

CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXx'
CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
OAUTH_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
OAUTH_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)
return twitter_api

def twitter_search(twitter_api, q, max_results=1000, **kw):

search_results = twitter_api.search.tweets(q=q, count=100, **kw)

statuses = search_results['statuses']

max_results = min(1000, max_results)
tweet_count = 0

for _ in range(10):
    try:
        next_results = search_results['search_metadata']['next_results']
    except KeyError, e: 
        break

    kwargs = dict([ kv.split('=') 
                    for kv in next_results[1:].split("&") ])

    search_results = twitter_api.search.tweets(**kwargs)
    statuses += search_results['statuses']

    tweet_count += 100
    print tweet_count

    if len(statuses) > max_results: 
        break

return statuses

def save_to_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):

client = pymongo.MongoClient(**mongo_conn_kw)
db = client[mongo_db]
coll = db[mongo_db_coll]

return coll.insert(data)

twitter_api = oauth_login() print "Authed to Twitter. Searching now..."

q = "#ISIS" results = twitter_search(twitter_api, q, max_results=1000) print "Results retrieved. Saving to MongoDB..."

save_to_mongo(results, 'search_results', q) In the terminal I get:

Authed to Twitter. Searching now... 100 200 Results retrieved. Saving to MongoDB... Then when I check the DB, 200 results. Every time. I've tried passing "10" for max_results, still 200. I've tried passing "1000" for max_results (as shown), still 200.

Thoughts?

— Reply to this email directly or view it on GitHub.

curtiswallen commented 10 years ago

That makes sense. Thanks!

So then, follow-up question: If I ran the request multiple times (scraping 200 tweets at a time), can I prevent the collection of duplicate results?

Is there a way to pull a 'next_results' node from the last tweet stored to the DB? So I could crawl back through the history of the query?

Or is that something I'll need to figure out on my own? ;-)

ptwobrussell commented 10 years ago

The best advice I could offer at this very moment would be to carefully review the official Search API docs at https://dev.twitter.com/docs/api/1.1/get/search/tweets since the API client used in the code is literally just providing Pythonic wrapper to this API. In other words, that API doc is the authority, and we'd need to do the same tinkering experimenting that it sounds like you're already doing to get to the bottom of some of these things.

I think your best bet is to probably make sure that tweets are keyed on their tweet_id so that you can trivially avoid duplicate results by effectively overwriting any pre-existing info that you'd get in subsequent batches. Or filter out duplicates at query time. Whichever is easiest for you.

On Aug 10, 2014, at 12:05 PM, curtiswallen notifications@github.com wrote:

That makes sense. Thanks!

So then, follow-up question: If I ran the request multiple times (scraping 200 tweets at a time), can I prevent the collection of duplicate results?

Is there a way to pull a 'next_results' node from the last tweet stored to the DB? So I could crawl back through the history of the query?

Or is that something I'll need to figure out on my own? ;-)

— Reply to this email directly or view it on GitHub.

curtiswallen commented 10 years ago

Cheers! Thanks so much, Matthew.

Love love love the book, and I tremendously admire/appreciate both your activity on github and all the work you've done to make the concepts and content so accessible. Can't wait to see what's next!

ptwobrussell commented 10 years ago

Thanks! So glad to hear it. Once you work through things some more, I'd love to hear more about your work and what helped/didn't help. Amazon reviews are also a luxury these days if you have a few moments to leave one of those at some point. Thanks again for the encouraging words.

On Sun, Aug 10, 2014 at 12:13 PM, curtiswallen notifications@github.com wrote:

Cheers! Thanks so much, Matthew.

Love love love the book, and I tremendously admire/appreciate both your activity on github and all the work you've done to make the concepts and content so accessible. Can't wait to see what's next!

— Reply to this email directly or view it on GitHub https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/issues/212#issuecomment-51720742 .

LisaCastellano commented 10 years ago

Thank you Matthew for your amazing work. I bought both the Mining the Social Web 2nd edition and the Dojo: The Definitive Guide too! I love your books and the way you've organized the content and the excercises. I followed your instructions in setting up the VB with vagrant and python etc: great! it works. In addition I installed Django and I'm working with python via web.

I had exactly the same issue of Curtis for twitter api exercises: 200 statuses returned.

Now it seems working: indeed my problem was due to the fact the next_results was url-encoded twice and the hashtag #Obama became: 1st time= %23Obama. 2nd time= %25%23Obama. and the third api call did not find any status, that's because I had 200 results only.

So I replaced the statements below:

kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ])

with the following ones:

next_results = urlparse.parse_qsl(next_results[1:]) kwargs = dict(next_results) importing urlparse in my py file.

Hope it can help. It's a pity that I cannot test again for a while since I reached the Twitter search api limits :(

Waiting for your next books!

ptwobrussell commented 10 years ago

Thanks so much for this update. I'll take a closer look and update the code in the repo soon.

nietzschetmh commented 10 years ago

Thanks a lot LisaCastellano. Your solution works great for me!