mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
12.06k stars 981 forks source link

Question regarding Twitter scraping. #4968

Open Azuriye opened 11 months ago

Azuriye commented 11 months ago

I need someone to clarify what could be scraped on Twitter.

Currently, I'm following users on Twitter and adding them to my Private List, but I don't know which approach is the best if I want to scrape every account I follow starting from now instead of the very first tweet.

Would scraping my private list work, scrape every user I follow or scrape my Twitter homepage? I'm not sure which approach is the best here if my main goal is to always scrape whatever the user posts and possibly retweet other account's posts.

Vladimir-russian commented 11 months ago

I guess you have to scrap one by one.

mikf commented 11 months ago

To get the most Tweets initially, you should have gallery-dl go through each user profile URL one by one. You can use your account's following URL or your list's members URL for that.

To update your collection, scraping from your list should be fastest, but again going through each followed user's profile would also work. Scraping your homepage is not supported, I think. Don't forget to use -A/--abort when updating to stop early.

(There is no (known) way to get all Tweets. The older a Tweet the less likely it turns up in a search, which is the only way to access older Tweets)

Azuriye commented 11 months ago

Just as a precaution, in case Twitter bans my account, my entire private list is taken down as well. To take this into account would it be better to scrape the list of members after every X hours using the -g argument and save them to a text file? Currently, that's what I am doing but I'm not sure if I could use this as a way to retrieve my private list which had been banned and to resume grabber scrapping like normal.

Secondly, this is my PowerShell script which I run every 6 hours using Task Scheduler.

# Set the Twitter list URL
$listUrl = "https://twitter.com/i/lists/{random_list_id}"

# Define the Start-Process parameters
$processParams = @{
    FilePath      = "gallery-dl"
    ArgumentList = "--range 1-200", $listUrl
    Wait         = $true
    WindowStyle  = "Hidden"
}

# Start the process with the specified parameters
Start-Process @processParams

Is this good enough to retrieve the latest 200 results from my list? This is to achieve scrapping everything latest from the time this script runs. I didn't specify -A or --abort anywhere is there a reason to use this parameter?