ptwobrussell / Mining-the-Social-Web-2nd-Edition

The official online compendium for Mining the Social Web, 2nd Edition (O'Reilly, 2013)
http://bit.ly/135dHfs
Other
2.9k stars 1.49k forks source link

9.4 Search - next_results absent when there are more results #181

Closed DanDarkly closed 10 years ago

DanDarkly commented 10 years ago

Apologies if this has been discussed but I couldn't find in in the closed issues. Also, I'm running this code as a stand-alone script, not in the VM. I doubt that matters, but thought I should mention it.

This is probably a problem with the search API rather than the recipe, but it seems to randomly fail to return a 'next_results' field in the 'search_metadata' when there are obviously more tweets out there.

I just did a search on 'obama' with a range of 10 (so theoretically 1000 results after 10 iterations). I think we can all agree there are a lot more than 1000 'obama' tweets, but when I run the script it breaks after a seemingly random number of iterations because search_results['search_metadata]['next_results'] is absent even though there are obviously more tweets out there.

For example, I just ran the script 3 times and it returned 99 tweets the first two times and 499 the third time. Nothing was changed in the code. To make sure I hadn't screwed something up, I then cut and pasted the code directly from here:

https://rawgit.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/master/ipynb/html/Chapter%209%20-%20Twitter%20Cookbook.html

I changed the search query to 'obama' and the max_results to 1000 but other than that I touched nothing. I ran it twice and the first time it returned 99 tweets, and the second time it returned 999.

Is this some sort of known issue with the API? Would it be better to change the break condition to use max_id somehow?

ptwobrussell commented 10 years ago

Dan - Thanks for mentioning this. I think you're right that the Twitter API does some wonky things at times. There is certainly plenty of room to catch additional kinds of exceptions, and I'll be giving this some thought since I'm sure it'll affect other people as well.

I'm trying to strike a balance between simple code that teaches the concepts but also is robust enough to prevent this kind of thing. If you have any good ideas for patches that you want to submit via a pull request, I'd welcome them. In the meanwhile, I'll be thinking on it myself...

DanDarkly commented 10 years ago

I have a really inelegant max_id solution working, but I don't think it would fit with your very clean code. I'll keep looking at it though.

Thanks for the response, and I love the book.

DanDarkly commented 10 years ago

This is, as I said, not at all elegant, but in case anyone else is looking at this, what I did was create a function to return the lowest tweet id in a group of returned tweets and use that to check if the lowest tweet id continues to go down with each set of returned tweets. You use the lowest tweet id in the max_id parameter of the search query to return a new set of older tweets.

Once Twitter runs out of older tweets, it just seems to return the last set of results over and over, so the lowest tweet id will start to repeat and then you should break out of the loop.

I'm quite the amateur at this, so I have no doubt there is an easier solution, but this seems to work. I'm also pretty sure I'll eventually figure out how to replace the id checking function with a simple list comprehension, but right now it's eluding me.