vladkens / twscrape

2024! X / Twitter API scrapper with authorization support. Allows you to scrape search results, User's profiles (followers/following), Tweets (favoriters/retweeters) and more.
https://pypi.org/project/twscrape/
MIT License
793 stars 104 forks source link

Retrieve only tweets with media discarding retweets #108

Closed lthrn closed 2 months ago

lthrn commented 5 months ago

Is there a method to retrieve only tweets with embedded media which are also original tweets (not retweets).

I know how to filter out retweets from the result, but the issue is when an account has i.e 1000 original tweets with media but 100000 tweets/retweets in total. With the current approximate limit (3200), most of the results are tweets without media or retweets.

I was also thinking about pagination to get like 1000 tweets per page or something to "overcome" the limit and filter out all tweets, but I couldn't find any method or parameter to do this.

Is this even possible? I know the twitter web UI can do it when entering the media tab on each user profile page (twitter.com/USERNAME/media), but maybe the API cannot do that.

Thanks a lot!

vladkens commented 5 months ago

Hi, @lthrn. Media tab is something new in twitter. Need investigate API for this. At now you can use search api, like:

api.search('from:@elonmusk filter:media')

More about search filters you can find here: https://github.com/igorbrigadir/twitter-advanced-search

lthrn commented 5 months ago

Hi,@vladkens. For now I'll try with the Search API like you suggested. Thanks a lot for your kind help.

lthrn commented 5 months ago

Hi again, @vladkens.

I tried using the Search API. Unfortunately the amount of tweets retrieved is even less than filtering out all tweet results. This may be because the API is getting only the "Top" tweets from the account.

I read the twitter advanced search link you suggested (thanks by the way), but I couldn't find any parameter o set of parameters to retrieve all tweets without the "Top" restriction.

Do you have any suggestion?

Thanks a lot again!

vladkens commented 2 months ago

@lthrn from readme:

    # change search tab (product), can be: Top, Latest (default), Media
    await gather(api.search("elon musk", limit=20, kv={"product": "Top"}))

Default is Latest used.

lthrn commented 2 months ago

@vladkens Thanks a lot for the update.

I would like to know hot to pass the "Media" value using the CLI search command. I tried passing kv as argument but it says it's invalid: unrecognized arguments: --kv={product: Media}.

Thanks again.