Closed OmgImAlexis closed 3 years ago
While this would be nice I don't think it's possible. If it is possible it would require a lot more hacking into youtube-dl
internals which I'd probably like to avoid doing just so updating libraries etc. is trivial. youtube-dl
's extract_info()
with extract_flat=True
just returns all video IDs on a channel or playlist etc. From a quick check, there's no way to know the order of videos on a playlist or channel so you can't just not crawl "page 22" or similar because "page 21" already has videos older than the download cap. You would need to index every video on a playlist/channel to find potentially new videos, which would result in having to check every video upload date against any set age caps. That could be classed as a debug log message not an info log message though to be suppressed that way...
Yep looks like I need to add a PR to youtube-dl.
Dateafter shouldn’t download pages outside of the range.
While that upstream issue if added might help the logs, it likely won't stop the requirement that TubeSync will still need to index all YouTube video IDs in a playlist each time it does an index as they are not assured to be chronologically returned by YouTube when it gets crawled. The initial requirement with TubeSync is to "find all new video IDs" which still means indexing entire channels and playlists. This flag, if implemented upstream in youtube-dl
, would likely just limit what's returned by extract_info()
rather than limit what's actually requested from YouTube. If there is some enforcement of chronological ordering feature that could be used for YouTube it likely wouldn't be transferable to other sites which will get support in TubeSync in the future either. Of course, if devs of youtube-dl
who admittedly do have a far superior knowledge of the internals of YouTube APIs/front ends and their own codebase Than I do find a way to actually make this work properly I would implement it. In the foreseeable short term however you can expect TubeSync to index entire channels and videos every index and compare the upload dates to function properly. I'll still see if changing the log severity is sensible though to stop annoying users who attempt to index very large channels a lot.
I hate to suggest a major refactor, but I wanted to give some ideas that might help with this problem.
Using youtube-dl to generate an index of all the videos and store them in a database to slowly download them does make sense, but when looking for new videos, tubesync seems to be configured to redownload the entire index of videos again to look for updates.
A more efficient method to look for new videos would be to use the integrated YouTube RSS feeds. They're always ordered by "published" date
https://www.youtube.com/feeds/videos.xml?channel_id=someidhere
Adding an RSS/XML parser to the system might be a slight hassle, but it would significantly reduce the risk of youtube getting mad at excessive page indexing.
Cheers for the suggestion!
I had noticed the RSS feeds, but compared to the current youtube-dl
based method it doesn't actually reduce the number of requests made to YouTube that much. After getting a list of video IDs for a channel or playlist TubeSync still needs to make one request per video to get its metadata and these are the bulk of the requests to YouTube that seem to be triggering the rate limiting. For example adding a channel with 1000 videos in it results in about 25 requests for indexing, then 1000 requests for metadata, once indexed it's "just" 25 requests per indexing interval period which is probably fine.
Additionally, unless I'm blind, I can't see any way to get more than the most recent 14 or so videos via RSS (there's no ?page=2
or similar accepted parameter I can find?) so while that would indeed work for updating for new content easily it doesn't solve the initial index all media on a channel requirement.
Also I assume if a channel added > 14 videos between indexing it would have to fall back to the current way as well, which I guess is pretty unlikely but no doubt someone will find a channel that does this and trigger an edge case of missing content.
Using the feeds could shave off a few requests per day, but not enough to likely solve issues for anyone experiencing 429 rate limiting issues, for which I'll probably have to just add in some 60 second delay between metadata requests to pad requests out for newly added channels or similar if people keep experiencing problems.
I'll add it onto the future roadmap as a possible feature as using the feeds would be nicer to keep channels updated with new content. It won't replace anything too significant internally and it's also not that much work really, just use a different indexer once already indexed at least once. It wouldn't require any massive internal reworking.
I'll track the RSS feature in #73 and the log level / log spam reduction options in #74 - I'll close this for now as I don't think there's anything left to add to the original issue, but feel free to comment or re-open it if you want to add more suggestions or comments.
On initial index if "Download cap" is set then pages should only be fetched until it hits that the cap instead of fetching every single page of videos.
I'd like to avoid seeing this over and over in my logs if possible.