Closed mfountoulakis closed 11 years ago
I can't test this right now, but it appears that there is a missing second parameter in the call to getUserInfo that is the connection to Redis. Try updating line 65 to be
getUserInfo(user_ids=next_queue, r)
and let me know how that works for you. I think it should fix the issue, which must be a regression that occurred during some of the code clean up and refactoring. (It's a bit puzzling how that parameter came to be missing.)
Thanks for the quick response! Unfortunately, i've gotten the following output:
Traceback (most recent call last):
File "friendsfollowerscrawl.py", line 65, in
could this be an issue with twitter util?
I'm sorry, I misread that code, try this:
getUserInfo(t, r, user_ids=next_queue)
Basically, you invoke this call just like it's done a couple of times earlier in the file. For some reason during a refactor, the t and r parameters must have gotten left off somehow. Please let me know if this works for you.
Sorry, didn't want to close the issue without updating you. getUserInfo(t, r, user_ids=next_queue) did the trick. Thank you very much, Matthew. I think we can cross that off the list.
Awesome. Fix is now checked in on commit f77b7437d1603c1b36e72c54bbb7e445e89d3a3e
Hi Matthew, sorry to bring this to your attention again but the code for example 4-8 still seems like it isn't working for me. I guess the (t, r, user_ids=next_queue) only fixed a part of the problem. I'm still getting an error output:
Traceback (most recent call last):
File "friendsfollowerscrawl.py", line 66, in
I don't really know what is causing the error, but it occurs after the rate limit is reached. As always, I would appreciate any insight that you could give me.
The formatting on your paste may be masking it, but in
twitter__util.py, on line 57, just before the line causing the error, there's a statement
print >> sys.stderr, 'Rate limit reached: sleeping for %i secs' % (sleep_time, )and it would be very helpful to know what it prints out.
Basically, all requests to Twitter and proxies through makeTwitterRequest in
twitter__utilso that when an issue like the rate limit is exceeded or the fail whale happens, the exception can be caught and a series of retries can begin. From within
makeTwitterRequest,
handleTwitterHTTPErroris called. Adding some additional logging in the following
elifblock would likely be very helpful in diagnosing:
elif _getRemainingHits(t) == 0: status = t.account.rate_limit_status() now = time.time() # UTC when_rate_limit_resets = status['reset_time_in_seconds'] # UTC sleep_time = when_rate_limit_resets - now print >> sys.stderr, 'Rate limit reached: sleeping for %i secs' % (sleep_time, ) time.sleep(sleep_time) return 2
I'm sorry that this is happening and will do all that I can to help you. In writing the book, I'll admit that debugging this code was one of the tricker parts. This weekend, I can try to set aside some time to try and recreate the problem if some logging doesn't help us dig into the problem before then. Hang in there, we'll get through it.
thanks for the support and for your follow-up! It seems like the statement
print >> sys.stderr, 'Rate limit reached: sleeping for %i secs' % (sleep_time, )
prints the following output:
Rate limit reached: sleeping for 1939 secs Rate limit reached: sleeping for -1 secs
Awesome. This is very helpful. So, what's happening is that the calculation for how long to sleep till the next attempt to hit the Twitter API is somehow producing a negative number. I'll take a closer look and report back soon on whether or not I'm able to easily figure out how this is happening (or suggest a workaround.)
Great! thanks very much for the explanation. I'd love to see what you find.
Actually, here's a workaround, and what might be the final solution if the issue is somehow due to clock skew issue related to rounding (possible, but I'm not completely sure yet although nothing else obvious is jumping out at me)
Make line 56 of twitter__util.py be this:
sleep_time = max(when_rate_limit_resets - now, 5)
What that line will do is guarantee that sleep_time will number be a negative number which seemed to be the source of your problem. Instead, if that case ever happens, it'll be 5 instead. This seems like a reasonable fix, and I hope it helps you. Please follow up with any more details. Always glad to help you work through these things as they creep up.
I'm going to think about this a bit more before I make that commit, but may go ahead and commit it if I can't come up with any better theories.
Hi again,
Just out of curiosity, why would sleep_time be a negative number in the first place? Also, changing the sleep time has definitely made the code more resilient in terms of crawling user id's. Is there any way to prevent the code from exceeding the rate limit? Here is the beast of an output that i've been getting:
Traceback (most recent call last):
File "friendsfollowerscrawl.py", line 66, in
sleep_time definitely should not be a negative number, and from looking at the calculation again, what I think is happening is that the calculation (when_rate_limit_resets - now) can be negative more times than I originally anticipated -- anytime now > when_rate_limit_resets, it'll work out to be a negative number. So, the max() guard around it fixes this. I may go ahead and commit this in a moment.
You could refactor the code to try and never have the rate limit exceeded by making requests at a slower pace, but the approach I took was to harvest as much as possible as fast as possible and then wait until Twitter allows us to do more. Either way, you are bounded by whatever their limit per unit time is.
But you are still getting a 400 error it looks like? It's easy enough to add an additional case to handle this in handleTwitterHTTPError, but it's hard to imagine how it would be happening, because the code shouldn't be making requests when there aren't any remaining requests left. This might be a subtle error in the logic, but it could possibly a small delay in propagating information on Twitter's backend. I may go ahead and commit in a patch for it as well later.
I'm still running the crawl. It looks like you're right in that this is an efficiency issue. As per your advice, I will adjust twitter_util to run at a slower rate, and I think that it will solve the biggest issue:
Rate limit reached: sleeping for 999 secs Rate limit reached: sleeping for 3599 secs
The rate limit is exceeded consecutively. The only other thing is the 400 error, which I still receive, but it seems like you're looking into it. Thanks!
I believe we effectively addressed this issue by ensuring that the value of sleep should never be negative. Closing.
Could you please help me understand why I get the following output with example 4-8?
File "friendsfollowerscrawl.py", line 65, in
crawl([SCREEN_NAME])
File "friendsfollowerscrawl.py", line 59, in crawl
getUserInfo(user_ids=next_queue)
TypeError: getUserInfo() takes at least 2 arguments (1 given)
I just don't understand what's missing here or what I could to to fix it. I'd appreciate any input.
-Manos