vladkens / twscrape

2024! X / Twitter API scrapper with authorization support. Allows you to scrape search results, User's profiles (followers/following), Tweets (favoriters/retweeters) and more.
https://pypi.org/project/twscrape/
MIT License
1.12k stars 133 forks source link

Can this download actual media files? #216

Open billbeans opened 2 months ago

billbeans commented 2 months ago

Maybe I'm a bit confused about what this software does, but can it actually grab a user's uploaded media (jpg, mp4) from their tweets and download them? I ran user_media on a profile, and I just got a bunch of stdout in my terminal. I saved that output to a text file and had a hell of a time grepping the links out of it to make wget work, and even then, it didn't grab all of the media from the profile I wanted scraped

vladkens commented 1 month ago

@billbeans user_media is api call to twitter to get list of media – list of links to photos and videos. Its reason why use see many log in terminal.

There are no real media download in twscrape now, because no request about it before.

You can download media with this simple script now:

import asyncio
import os

import httpx

from twscrape import API

async def download_file(client: httpx.AsyncClient, url: str, outdir: str):
    filename = url.split("/")[-1].split("?")[0]
    outpath = os.path.join(outdir, filename)

    async with client.stream("GET", url) as resp:
        with open(outpath, "wb") as f:
            async for chunk in resp.aiter_bytes():
                f.write(chunk)

async def load_user_media(api: API, user_id: int, outdir: str):
    os.makedirs(outdir, exist_ok=True)
    all_photos = []
    all_videos = []

    async for doc in api.user_media(user_id):
        all_photos.extend([x.url for x in doc.media.photos])
        for video in doc.media.videos:
            variant = sorted(video.variants, key=lambda x: x.bitrate)[-1]
            all_videos.append(variant.url)

    async with httpx.AsyncClient() as client:
        await asyncio.gather(
            *[download_file(client, url, outdir) for url in all_photos],
            *[download_file(client, url, outdir) for url in all_videos],
        )

async def main():
    api = API()
    await load_user_media(api, 2244994945, "output")

if __name__ == "__main__":
    asyncio.run(main())