pystardust / ytfzf

A posix script to find and watch youtube videos from the terminal. (Without API)
GNU General Public License v3.0
3.69k stars 349 forks source link

Scraping youtube subscriptions is taking too long #251

Open Michal-Szczepaniak opened 3 years ago

Michal-Szczepaniak commented 3 years ago

Problem is when you have not tens of subscriptions but thousand or more. ytfzf takes tons of resources to scrape the subscriptions. What I would recommend: you are keeping subscriptions in file after you scrape them, good but then you are erasing entire cache and scraping them again when user just want to see vids from subs again. Why not update them? Go from latest to oldest and stop when the video is already in cache. That would save itme and resources.

Euro20179 commented 3 years ago

Using a method of checking based on upload date wouldn't work for a few reasons but the main one being that the upload date is relative,

I tried implementing a version where it checks if the video_data is the same (which will work because the video id is in it), and it was EXTREMELY slow, like it is definitely longer to do it that way than to just rescrape every time, it takes a long time to scrape each channel at a time

In addition I don't think it's possible to implement something like this with concurrency which would speed it up tremendously because there has to be some kind of an order in order to check against the cache file (if there is no order you'd have to scrape the channel anyway then check instead of just breaking from a loop), and with concurrency the order is random, but concurrency makes it faster, however concurrency is also what is using up the resources.

For each subscription there are 1-2 forks created, 1 to scrape the channel and another to download the thumbnails (if thumbnails are on)

Michal-Szczepaniak commented 3 years ago

Hmmm, Minitube is also scraping subscriptions but it doesn't take so long maybe it would be worth looking how how they did it? In Minitube subscriptions are being updated not overwritten

Euro20179 commented 1 year ago

I wonder if this is better now. I changed how the SI scraper works in the development branch so that it downloads basically at once with curl instead of creating a million threads.

Subscriptions are still overwritten every time though.