taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.39k stars 579 forks source link

Inconsistent results when queries split by time #264

Open doingthingswrong opened 4 years ago

doingthingswrong commented 4 years ago

So I'm trying to pull a dataset across 7 years for a modest query. I first tried one request with the standard pool size and I got around 40k results. As I increased the pool size, the number of results continually increased to maximum (about 45k after setting the pool size to 150). I got curious, and split the request in two by date, and ended up seeing the results increase yet again. (I think it was 49k?)

I split the requests into smaller and smaller time periods and kept getting more results (upwards of 60k), so I figured, why not program a for loop to make the pull request in monthly increments. I set the pool size back down to 28 per the recommendation in the readme, and I ended up only pulling about 37k results. Setting it to 31, gave me 39k results. For the fuck of it, I ran the for loop with the poolsize set to 150 and got a bit over 42k.

Does twitterscraper operate differently in a for loop than it does when hard coded and run?

marichig commented 4 years ago

Did you find an optimal time period for sampling? Did you try sampling by weeks instead of months? Sorry this doesn't answer your question, but I'm working on something similar and wanted to see