vladkens / twscrape

2024! X / Twitter API scrapper with authorization support. Allows you to scrape search results, User's profiles (followers/following), Tweets (favoriters/retweeters) and more.
https://pypi.org/project/twscrape/
MIT License
1.12k stars 133 forks source link

Wrapping API calls in `contextlib.aclosing` and `break`ing out of the generator loop doesn’t work as expected #158

Closed andylolz closed 7 months ago

andylolz commented 7 months ago

Many thanks for this project! It’s great.

The advice provided in the README for breaking out of the generator loop is very helpful. I followed the sample code provided in https://github.com/vladkens/twscrape/issues/27#issuecomment-1623395424. The simple examples work as expected, but the sample code that calls twscrape doesn’t.

Here’s my code:

from contextlib import aclosing
from twscrape import API
import asyncio

async def main():
    user_id = 68828618
    recent_tweet_id = 1620452398706884608

    api = API()
    await api.pool.login_all()

    async with aclosing(api.user_tweets(user_id)) as gen:
        async for tweet in gen:
            if tweet.id < recent_tweet_id:
                break

    print("This should happen second.")

if __name__ == "__main__":
    asyncio.run(main())

I also added print("This should happen first.") just here: https://github.com/vladkens/twscrape/blob/00a8e07b43c1fbbea95566cc3aae95db76cd4ae3/twscrape/queue_client.py#L80-L81

But when I run, I get the following output:

This should happen second.
This should happen first.

async with aclosing is not ensuring that the lock is released before continuing.

Any advice you can provide would be appreciated!

andylolz commented 7 months ago

Okay, I think I have made progress. This queue_client test is very helpful: https://github.com/vladkens/twscrape/blob/00a8e07b43c1fbbea95566cc3aae95db76cd4ae3/tests/test_queue_client.py#L123-L161

I can modify my example to be more similar to ^^ this example ^^, and it works as expected:

from contextlib import aclosing
from twscrape import API
from twscrape.api import OP_UserTweets
from twscrape.models import parse_tweets
import asyncio

async def main():
    user_id = 68828618
    recentish_tweet_id = 1018137538865909765

    api = API()
    await api.pool.login_all()

    op = OP_UserTweets
    kv = {
        "userId": str(user_id),
        "count": 40,
        "includePromotedContent": True,
        "withQuickPromoteEligibilityTweetFields": True,
        "withVoice": True,
        "withV2Timeline": True,
    }
    limit = 1_000

    done = False
    async with aclosing(api._gql_items(op, kv, limit=limit)) as wrapped_gen:
        async for tweets in wrapped_gen:
            for tweet in parse_tweets(tweets):
                if tweet.id < recentish_tweet_id:
                    done = True
                    break
            if done:
                break

    print("This should happen second.")

if __name__ == "__main__":
    asyncio.run(main())

This outputs:

This should happen first.
This should happen second.

🎉 🎉 🎉

This is great news, but I’d rather use the twscrape API than call the internals of twscrape.

Would you be interested in a pull request that adds more async with aclosing calls to the code?