Scraping stops though there are more tweets

minamotorin / twint

An advanced Twitter scraping & OSINT tool written in Python that doesn't use Twitter's API, allowing you to scrape a user's followers, following, Tweets and more while evading most API limitations.

MIT License

46 stars 17 forks source link

Scraping stops though there are more tweets #8

Open minamotorin opened 2 years ago

minamotorin commented 2 years ago

from: https://github.com/twintproject/twint/issues/462#issue-461236891

Twint stops after about 769 tweets on August 10, 2018 (22:41:16 UTC). When I do the search on Twitter itself, Twitter also stops listing new tweets at that point. However, when I scroll all the way to the top of the search results on Twitter and then all the way back down to the bottom, Twitter starts to provide additional results. It seems like requesting the same results led Twitter to provide more.

I've confirmed the same behavior.

minamotorin commented 2 years ago

The points seems to be always the same. As a workaround, you can continue scraping by using until.

Tortar commented 2 years ago

I found that this part of the code in url.py

    if "win" in platform:
        return f'\"{date.split()[0]}\"'

sometimes makes the scraper skip some tweets when using --until "%y-%m-%d %H:%M:%S" on Windows. It starts from some hours before the specified one. Removing these lines seems to achieve better results.

minamotorin commented 2 years ago

@Totar Your comment has nothing to do with this issue. I opened new issue (#12), so please talk there.

This is about the issue that Twitter search still has results but not displayed. --until doesn't matter. twint -s keyword will stop in middle even though there are still results if there are a lot of results.

Tortar commented 2 years ago

okay no problem...Just to add something to the discussion: more you go back in time more the number of results before the scraper stops becomes lower, in the order of a day of results

minamotorin commented 2 years ago

The number of results seems to no change. Twint just stops suddenly. An example is shown below (I'm not sure if it is the same in other environments).

twint -s twint --until 2019-09-24 # Twint shows 20 results and stops
twint -s twint --until 2019-09-22 --limit 10 # Twint shows more results

Tortar commented 2 years ago

I meant to say that (for the keywords I analyzed) the more you use a date in --until back in the past, the more the points where the scraper stops increase

minamotorin commented 2 years ago

@Tortar Oh, I didn't know the behavior, thanks for your reporting.

batmanscode commented 1 year ago

I'm having issues with the volume of tweets scraped as well. I am using both Since and Until.

For example, I'd search for @username with zero results but when I search in app there's plenty. And similarly, various keyword searches return far fewer results than expected. From some reading, the official API is able to get a lot more data.

It seems it's to do with Twitter not showing all tweets in a browser session. See this comment: https://github.com/JustAnotherArchivist/snscrape/issues/574#issuecomment-1287069321

Does anyone have any workarounds for this? Or is it just how it is?

Ramizworking commented 1 year ago

Hey guys @minamotorin just sent me here, still same issue.

@batmanscode did you find any way to fix this boss ?

batmanscode commented 1 year ago

Hey guys @minamotorin just sent me here, still same issue.

@batmanscode did you find any way to fix this boss ?

There's no technical way to fix it, but a workaround is to run more frequent scrapes. For example if you want to scrape a 7 days of tweets, run the scraper on days 1 through 7, concatenate and remove duplicates.