vladkens / twscrape

2024! X / Twitter API scrapper with authorization support. Allows you to scrape search results, User's profiles (followers/following), Tweets (favoriters/retweeters) and more.
https://pypi.org/project/twscrape/
MIT License
793 stars 104 forks source link

api.search limit? #109

Open washednico opened 5 months ago

washednico commented 5 months ago

Hello,

I'm currently running a scraper which should need to download every tweet, containing a particular cashtag for a month, however if I run the following code:

q = f"${ticker} since:{start_date} until:{end_date}" async for tweet in api.search(q): print(tweet.date)

and I use a range of a month, it can find like 1.3k tweets in the first two days inside the date range only, and then it stops. I'm sure however that there are many more tweets for each day of the month under consideration. What could it be?

vladkens commented 4 months ago

Hi, @washednico.

Hard to say why you have only 1.3k tweets. You can use less granularity to achive better results.

from datetime import datetime, timedelta

def iterate_dates(since_date: str, until_date: str):
    dt = datetime.fromisoformat(since_date)
    ed = datetime.fromisoformat(until_date)
    while dt < ed:
        nd = dt + timedelta(days=1)
        yield dt.date(), nd.date()
        dt = nd

async def get_ticker_tweets(ticker: str, since_date: str, until_date: str):
    for since, until in iterate_dates(since_date, until_date):
        q = f"${ticker} since:{since} until:{until}"
        async for tweet in api.search(q):
            yield tweet

# then use like
await get_ticker_tweets("AAPL", "2024-01-01", "2024-01-10")
washednico commented 4 months ago

Hi, @washednico.

Hard to say why you have only 1.3k tweets. You can use less granularity to achive better results.


from datetime import datetime, timedelta

def iterate_dates(since_date: str, until_date: str):

    dt = datetime.fromisoformat(since_date)

    ed = datetime.fromisoformat(until_date)

    while dt < ed:

        nd = dt + timedelta(days=1)

        yield dt.date(), nd.date()

        dt = nd

async def get_ticker_tweets(ticker: str, since_date: str, until_date: str):

    for since, until in iterate_dates(since_date, until_date):

        q = f"${ticker} since:{since} until:{until}"

        async for tweet in api.search(q):

            yield tweet

# then use like

await get_ticker_tweets("AAPL", "2024-01-01", "2024-01-10")

I believe it's a problem with the twitter endpoints since even if I split the search in intervals in groups of 3/4 days some days don't even contain any tweet. Which doesn't make any sense since the daily average is 500+

ritikkumarsahu commented 3 months ago

Hi, @washednico. Hard to say why you have only 1.3k tweets. You can use less granularity to achive better results.

from datetime import datetime, timedelta

def iterate_dates(since_date: str, until_date: str):

    dt = datetime.fromisoformat(since_date)

    ed = datetime.fromisoformat(until_date)

    while dt < ed:

        nd = dt + timedelta(days=1)

        yield dt.date(), nd.date()

        dt = nd

async def get_ticker_tweets(ticker: str, since_date: str, until_date: str):

    for since, until in iterate_dates(since_date, until_date):

        q = f"${ticker} since:{since} until:{until}"

        async for tweet in api.search(q):

            yield tweet

# then use like

await get_ticker_tweets("AAPL", "2024-01-01", "2024-01-10")

I believe it's a problem with the twitter endpoints since even if I split the search in intervals in groups of 3/4 days some days don't even contain any tweet. Which doesn't make any sense since the daily average is 500+

Found any solution? I am also facing same issue, It is not even grabbing 50% of the tweets.

washednico commented 3 months ago

Hi, @washednico.

Hard to say why you have only 1.3k tweets. You can use less granularity to achive better results.


from datetime import datetime, timedelta

def iterate_dates(since_date: str, until_date: str):

    dt = datetime.fromisoformat(since_date)

    ed = datetime.fromisoformat(until_date)

    while dt < ed:

        nd = dt + timedelta(days=1)

        yield dt.date(), nd.date()

        dt = nd

async def get_ticker_tweets(ticker: str, since_date: str, until_date: str):

    for since, until in iterate_dates(since_date, until_date):

        q = f"${ticker} since:{since} until:{until}"

        async for tweet in api.search(q):

            yield tweet

# then use like

await get_ticker_tweets("AAPL", "2024-01-01", "2024-01-10")

I believe it's a problem with the twitter endpoints since even if I split the search in intervals in groups of 3/4 days some days don't even contain any tweet. Which doesn't make any sense since the daily average is 500+

Found any solution? I am also facing same issue, It is not even grabbing 50% of the tweets.

Unfortunately no since I believe the problem lies on twitter's endpoint, even searching manually sometimes doesn't give any result which doesn't make sense. I've tried splitting the data-range in days, weeks and months but nothing changed. I've basically tried to randomise the day-range search but I've never been able to scrape some specific days that somehow are not working.