ptwobrussell / Mining-the-Social-Web

The official online compendium for Mining the Social Web (O'Reilly, 2011)
http://bit.ly/135dHfs
Other
1.21k stars 491 forks source link

example 4-8 #18

Closed mfountoulakis closed 11 years ago

mfountoulakis commented 12 years ago

Could you please help me understand why I get the following output with example 4-8?

File "friendsfollowerscrawl.py", line 65, in crawl([SCREEN_NAME]) File "friendsfollowerscrawl.py", line 59, in crawl getUserInfo(user_ids=next_queue) TypeError: getUserInfo() takes at least 2 arguments (1 given)

I just don't understand what's missing here or what I could to to fix it. I'd appreciate any input.

-Manos

ptwobrussell commented 12 years ago

I can't test this right now, but it appears that there is a missing second parameter in the call to getUserInfo that is the connection to Redis. Try updating line 65 to be

getUserInfo(user_ids=next_queue, r)

and let me know how that works for you. I think it should fix the issue, which must be a regression that occurred during some of the code clean up and refactoring. (It's a bit puzzling how that parameter came to be missing.)

mfountoulakis commented 12 years ago

Thanks for the quick response! Unfortunately, i've gotten the following output:

Traceback (most recent call last): File "friendsfollowerscrawl.py", line 65, in crawl([SCREEN_NAME]) File "friendsfollowerscrawl.py", line 59, in crawl getUserInfo(r, user_ids=next_queue,) TypeError: getUserInfo() takes at least 2 arguments (2 given)

could this be an issue with twitter util?

ptwobrussell commented 12 years ago

I'm sorry, I misread that code, try this:

getUserInfo(t, r, user_ids=next_queue)

Basically, you invoke this call just like it's done a couple of times earlier in the file. For some reason during a refactor, the t and r parameters must have gotten left off somehow. Please let me know if this works for you.

mfountoulakis commented 12 years ago

Sorry, didn't want to close the issue without updating you. getUserInfo(t, r, user_ids=next_queue) did the trick. Thank you very much, Matthew. I think we can cross that off the list.

ptwobrussell commented 12 years ago

Awesome. Fix is now checked in on commit f77b7437d1603c1b36e72c54bbb7e445e89d3a3e

mfountoulakis commented 12 years ago

Hi Matthew, sorry to bring this to your attention again but the code for example 4-8 still seems like it isn't working for me. I guess the (t, r, user_ids=next_queue) only fixed a part of the problem. I'm still getting an error output:

Traceback (most recent call last): File "friendsfollowerscrawl.py", line 66, in crawl([SCREEN_NAME]) File "friendsfollowerscrawl.py", line 59, in crawl getUserInfo(t, r, user_ids=next_queue) File "/Users/Manos/Desktop/PC/twitter__util.py", line 136, in getUserInfo user_id=user_ids_str) File "/Users/Manos/Desktop/PC/twitterutil.py", line 21, in makeTwitterRequest wait_period = handleTwitterHTTPError(e, t, wait_period) File "/Users/Manos/Desktop/PC/twitterutil.py", line 58, in handleTwitterHTTPError time.sleep(sleep_time) IOError: [Errno 22] Invalid argument

I don't really know what is causing the error, but it occurs after the rate limit is reached. As always, I would appreciate any insight that you could give me.

ptwobrussell commented 12 years ago

The formatting on your paste may be masking it, but in

twitter__util.py
, on line 57, just before the line causing the error, there's a statement
print >> sys.stderr, 'Rate limit reached: sleeping for %i secs' % (sleep_time, )
and it would be very helpful to know what it prints out.

Basically, all requests to Twitter and proxies through makeTwitterRequest in

twitter__util
so that when an issue like the rate limit is exceeded or the fail whale happens, the exception can be caught and a series of retries can begin. From within
makeTwitterRequest
,
handleTwitterHTTPError
is called. Adding some additional logging in the following
elif
block would likely be very helpful in diagnosing:

elif _getRemainingHits(t) == 0:
        status = t.account.rate_limit_status()
        now = time.time()  # UTC
        when_rate_limit_resets = status['reset_time_in_seconds']  # UTC
        sleep_time = when_rate_limit_resets - now
        print >> sys.stderr, 'Rate limit reached: sleeping for %i secs' % (sleep_time, )
        time.sleep(sleep_time)
        return 2

I'm sorry that this is happening and will do all that I can to help you. In writing the book, I'll admit that debugging this code was one of the tricker parts. This weekend, I can try to set aside some time to try and recreate the problem if some logging doesn't help us dig into the problem before then. Hang in there, we'll get through it.

mfountoulakis commented 12 years ago

thanks for the support and for your follow-up! It seems like the statement

print >> sys.stderr, 'Rate limit reached: sleeping for %i secs' % (sleep_time, )

prints the following output:

Rate limit reached: sleeping for 1939 secs Rate limit reached: sleeping for -1 secs

ptwobrussell commented 12 years ago

Awesome. This is very helpful. So, what's happening is that the calculation for how long to sleep till the next attempt to hit the Twitter API is somehow producing a negative number. I'll take a closer look and report back soon on whether or not I'm able to easily figure out how this is happening (or suggest a workaround.)

mfountoulakis commented 12 years ago

Great! thanks very much for the explanation. I'd love to see what you find.

ptwobrussell commented 12 years ago

Actually, here's a workaround, and what might be the final solution if the issue is somehow due to clock skew issue related to rounding (possible, but I'm not completely sure yet although nothing else obvious is jumping out at me)

Make line 56 of twitter__util.py be this:

sleep_time = max(when_rate_limit_resets - now, 5)

What that line will do is guarantee that sleep_time will number be a negative number which seemed to be the source of your problem. Instead, if that case ever happens, it'll be 5 instead. This seems like a reasonable fix, and I hope it helps you. Please follow up with any more details. Always glad to help you work through these things as they creep up.

I'm going to think about this a bit more before I make that commit, but may go ahead and commit it if I can't come up with any better theories.

mfountoulakis commented 12 years ago

Hi again,

Just out of curiosity, why would sleep_time be a negative number in the first place? Also, changing the sleep time has definitely made the code more resilient in terms of crawling user id's. Is there any way to prevent the code from exceeding the rate limit? Here is the beast of an output that i've been getting:

Traceback (most recent call last): File "friendsfollowerscrawl.py", line 66, in crawl([SCREEN_NAME]) File "friendsfollowerscrawl.py", line 59, in crawl getUserInfo(t, r, user_ids=next_queue) File "/Users/Manos/Desktop/PC/twitter__util.py", line 136, in getUserInfo user_id=user_ids_str) File "/Users/Manos/Desktop/PC/twitterutil.py", line 21, in makeTwitterRequest wait_period = handleTwitterHTTPError(e, t, wait_period) File "/Users/Manos/Desktop/PC/twitterutil.py", line 61, in handleTwitterHTTPError raise e twitter.api.TwitterHTTPError: Twitter sent status 400 for URL: 1/users/lookup.json using parameters: (oauth_consumer_key=&oauth_nonce=16419692833626363961&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1333007434&oauth_token=522590280-d3LcY61bPjLcWwPBG5o8eHlUNt5rQqfZpARvXHsE&oauth_version=1.0&user_id=312278554%2C86917711%2C299820816%2C147046496%2C140836096%2C316333875%2C101872402%2C373979565%2C18108884%2C338385671%2C223425988%2C376870056%2C373390109%2C95685705%2C157414115%2C114553137%2C227915232%2C316967288%2C150627312%2C363317591%2C136109701%2C357164566%2C219372333%2C137802656%2C193853430%2C28284188%2C268209494%2C375296444%2C375285196%2C260014814%2C374884940%2C358037968%2C61348274%2C20353443%2C356935736%2C255088442%2C30684830%2C240958548%2C166866658%2C374554337%2C377403664%2C312576822%2C31505324%2C194095748%2C135254445%2C267310501%2C195957817%2C204883561%2C15161281%2C267341756%2C271398153%2C208576528%2C137803498%2C143470777%2C350803001%2C229058830%2C232342893%2C179149969%2C294846379%2C364421494%2C264128201%2C56468304%2C927061%2C376232240%2C112160065%2C376895048%2C286629098%2C91770244%2C76745868%2C375411866%2C340108817%2C200417521%2C351718262%2C91306129%2C55685635%2C156356678%2C39547164%2C70624382%2C352997492%2C278864628%2C29306178%2C377206070%2C374566288%2C22614712%2C367781251%2C319564181%2C253154527%2C34882518%2C359004324%2C14769529%2C8240332%2C172023201%2C358316995%2C370149371%2C374655059%2C224502434%2C248406920%2C356548292%2C310910431%2C334237339&oauth_signature=xnRPypGZxUk7b2I6%2F1wix3N9US0%3D) details: {"error":"Rate limit exceeded. Clients may not make more than 350 requests per hour.","request":"\/1\/users\/lookup.json?oauth_consumer_key=&oauth_nonce=16419692833626363961&oauth_signature_method=HMAC-SHA1&oauth_timestamp=1333007434&oauth_token=522590280-d3LcY61bPjLcWwPBG5o8eHlUNt5rQqfZpARvXHsE&oauth_version=1.0&user_id=312278554%2C86917711%2C299820816%2C147046496%2C140836096%2C316333875%2C101872402%2C373979565%2C18108884%2C338385671%2C223425988%2C376870056%2C373390109%2C95685705%2C157414115%2C114553137%2C227915232%2C316967288%2C150627312%2C363317591%2C136109701%2C357164566%2C219372333%2C137802656%2C193853430%2C28284188%2C268209494%2C375296444%2C375285196%2C260014814%2C374884940%2C358037968%2C61348274%2C20353443%2C356935736%2C255088442%2C30684830%2C240958548%2C166866658%2C374554337%2C377403664%2C312576822%2C31505324%2C194095748%2C135254445%2C267310501%2C195957817%2C204883561%2C15161281%2C267341756%2C271398153%2C208576528%2C137803498%2C143470777%2C350803001%2C229058830%2C232342893%2C179149969%2C294846379%2C364421494%2C264128201%2C56468304%2C927061%2C376232240%2C112160065%2C376895048%2C286629098%2C91770244%2C76745868%2C375411866%2C340108817%2C200417521%2C351718262%2C91306129%2C55685635%2C156356678%2C39547164%2C70624382%2C352997492%2C278864628%2C29306178%2C377206070%2C374566288%2C22614712%2C367781251%2C319564181%2C253154527%2C34882518%2C359004324%2C14769529%2C8240332%2C172023201%2C358316995%2C370149371%2C374655059%2C224502434%2C248406920%2C356548292%2C310910431%2C334237339&oauth_signature=xnRPypGZxUk7b2I6%2F1wix3N9US0%3D"}

ptwobrussell commented 12 years ago

sleep_time definitely should not be a negative number, and from looking at the calculation again, what I think is happening is that the calculation (when_rate_limit_resets - now) can be negative more times than I originally anticipated -- anytime now > when_rate_limit_resets, it'll work out to be a negative number. So, the max() guard around it fixes this. I may go ahead and commit this in a moment.

You could refactor the code to try and never have the rate limit exceeded by making requests at a slower pace, but the approach I took was to harvest as much as possible as fast as possible and then wait until Twitter allows us to do more. Either way, you are bounded by whatever their limit per unit time is.

But you are still getting a 400 error it looks like? It's easy enough to add an additional case to handle this in handleTwitterHTTPError, but it's hard to imagine how it would be happening, because the code shouldn't be making requests when there aren't any remaining requests left. This might be a subtle error in the logic, but it could possibly a small delay in propagating information on Twitter's backend. I may go ahead and commit in a patch for it as well later.

mfountoulakis commented 12 years ago

I'm still running the crawl. It looks like you're right in that this is an efficiency issue. As per your advice, I will adjust twitter_util to run at a slower rate, and I think that it will solve the biggest issue:

Rate limit reached: sleeping for 999 secs Rate limit reached: sleeping for 3599 secs

The rate limit is exceeded consecutively. The only other thing is the 400 error, which I still receive, but it seems like you're looking into it. Thanks!

ptwobrussell commented 11 years ago

I believe we effectively addressed this issue by ensuring that the value of sleep should never be negative. Closing.