imperatrona / twitter-scraper

Scrape the Twitter frontend API without authentication with Golang.
MIT License
74 stars 9 forks source link

Getting all tweets / replies #19

Open thewh1teagle opened 2 days ago

thewh1teagle commented 2 days ago

Hey!

Thanks for creating such a great library!

I'm trying to retrieve all of my tweets and replies (I have thousands), but I couldn't find any mention of pagination to fetch beyond the maximum limit. Does the library support this feature?

Also, I don't see an option to get my own username or user ID after authentication. Could you clarify how to achieve that?

imperatrona commented 1 day ago

hi @thewh1teagle! saw you contributed UserTweetsAndReplies thank you so much! i’ll add tests and merge it tomorow.

i suppose you already found out how to paginate.

this library haven’t yet implemented getting the current user from cookie. you can get screen_name / user_id by making get request to https://api.twitter.com/1.1/account/multi/list.json with empty body.

though if your cookie has multiple accounts logged in (have auth_multi cookie) this will return data for all accounts without flaging which is currently active. to get currently active screen_name you can make get request to https://api.twitter.com/1.1/account/settings.json

thewh1teagle commented 10 hours ago

Thanks, now we have account endpoints for getting screen_name :)

I tried to use FetchTweetsAndRepliesByUserID by iterate it and sleep 10 seconds between each iteration

But got this error:

panic: response status 429 Too Many Requests: Rate limit exceeded
func run() {
    creds, err := auth.GetCredentials()
    if err != nil {
        panic(err)
    }
    scraper := twitterscraper.New()
    authToken := twitterscraper.AuthToken{Token: creds.AuthToken, CSRFToken: creds.Ct0}
    scraper.SetAuthToken(authToken)

    if !scraper.IsLoggedIn() {
        panic("Invalid AuthToken")
    }
    settings, err := scraper.GetAccountSettings()
    if err != nil {
        panic(err)
    }
    log.Println("Logged in as: ", settings.ScreenName)

    userId, err := scraper.GetUserIDByScreenName(settings.ScreenName)
    if err != nil {
        panic(err)
    }

    // Load cursors for posts and replies
    cursorPosts, err := storage.LoadCursor(".cursor_posts")
    log.Println("Current Cursor:", cursorPosts)
    if err != nil {
        log.Println("No cursor file found for posts, starting from the beginning.")
        cursorPosts = ""
    }

    // Counter for the number of fetched tweets
    fetchedCount := 0

    // First loop to fetch and save posts
    for {
        // Fetch tweets using the cursor
        tweets, newCursorPosts, err := scraper.FetchTweetsAndRepliesByUserID(userId, 20, cursorPosts)
        if err != nil {
            panic(err)
        }

        // If no new tweets are fetched, exit the loop
        if len(tweets) == 0 {
            log.Println("No new posts found. Exiting...")
            break
        }

        // Increment the fetched count by the number of newly fetched tweets
        fetchedCount += len(tweets)
        log.Printf("Fetched %d new tweets. Total fetched: %d\n", len(tweets), fetchedCount)

        // Save the new cursor state for posts
        if err := storage.SaveCursor(".cursor_posts", newCursorPosts); err != nil {
            panic(err)
        }

        // Save each tweet in JSONL format
        if err := storage.SaveTweetJSONL("posts.jsonl", tweets); err != nil {
            panic(err)
        }

        // Update cursor for the next iteration
        cursorPosts = newCursorPosts

        // Optional: Delay to avoid hitting rate limits
        time.Sleep(sleepBetweenRequests)
    }

    log.Printf("Total tweets fetched: %d\n", fetchedCount)
}

The default sleepBetweenRequest is 10*time.Second`

Did I make the requests too quickly? How many tweets it takes by default? I noticed that in twitter UI it takes 20 at each scroll.

imperatrona commented 10 hours ago

@thewh1teagle i was doing the same task recently and 15 seconds delay was enough. each request usually return 20 tweets, but sometimes can do 15-90. this lib has implemented method scraper.WithDelay(15) which you can use instead your sleepBetweenReques

thewh1teagle commented 1 hour ago

@imperatrona

Thanks! good to hear that you did it recently, though note that I'm using the new endpoint from https://github.com/imperatrona/twitter-scraper/pull/20. It's almost the same like the getTweets except that it returns basically everything that the user posted - tweets / replies / reposts / quotes etc. I changed it to use withDelay instead of sleeping and increased the timeout. I'll check later. hope it will works without this error.