Use yt-dlp instead of Youtube API

rgaudin commented 1 year ago

Youtube Data API v3 has served us relatively well for 4 years now. I think it's time to move away from it because:

its restrictions are annoying. we have to manage those API keys, manually set IP whitelist (cant be done programaticaly) and switch from time to time to respect the quota (or suffer failures).
it's artificially limited and/or buggy/different from the web UI. we've seen several cases where public stuff are not available via the API.

youtube-dl and the fork we use (yt-dlp) has greatly improved in 4y. Switching to it would have the following benefits:

Access to whatever is visible online
No API keys needed anymore
More flexible target specification (I suppose): using YT URLs
Generic: this would be a separate task but yt-dlp supporting many platforms and methods, it shall enable retrieving videos from various places… ⚠️ ZIMs are not single videos. Not sure how we can reproduce a standard experience using data from different platforms. Don't expect a turnkey feature here.

This change would require an important revamp of the scraper but partly because it's still a filesystem-based one

Important feature check list to test/poc first:

get list of playlists with details (name, description)
get list of videos for each playlists
get video details (author, title, description, date)
download video (already via yt-dlp)
get video thumbnails (already via yt-dlp)
get video subtitles (already via yt-dlp)
get author's metadata (name, description) and branding (banner, profile)

benoit74 commented 11 months ago

I did small tests of yt-dlp. They are very positive.

Test context : Python 3.11.4, yt-dlp 2023.10.7

list of playlists with details : Yes, many details included
list of videos for each playlists : Yes, many details included
get video details : Yes, many details included
author / channel metadata:
- title, description are available, many other information as well
branding:
- banner : multiple resolutions provided, but only the big image for TV resolutions according to Youtube UI, to get the cropped version used on computers or the even smaller one used on phones, you have to crop yourself
- profile picture : multiple resolutions provided

Is it working for weird channel names like @Madrasa which does not work without a channel ID

Yes, it even found 5698 videos ...

Is it working for user ID types (old ones, instead of channel)

Yes, it does not make a difference (tested for DirtyBiology which is a channel and cestpassorcierofficiel which is a user)

How to extract all information mentioned above ?

I used this code:

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
  info = ydl.extract_info(url, download=False)

One can also pass "process=False" to not retrieve details (e.g. when you want only author info, not all his videos)

You can use following URL:

https://www.youtube.com/watch?v=BaW_jenozKc => get details about one video
https://www.youtube.com/@PhilippHagemeister => get details about one channel / user and all its videos
https://www.youtube.com/@PhilippHagemeister/playlists => get details about all playlists, and all videos in every playlist
https://www.youtube.com/channel/UCtqICqGbPSbTN09K1_7VZ3Q => get details about one channel by ID
https://www.youtube.com/playlist?list=PL5Pd1geIk9IUBWUoUUNyBehNl0q5D1IuE => get details about one playlist and its videos

The main limitation is that is seems hard to request only few data (e.g. get the list of all playlists but not the videos which are within).

Whole code used for the tests:

import json
import yt_dlp

def process_one(ydl, url, filename):
    info = ydl.extract_info(url, download=False, process=False)
    with open(filename, "w") as fh:
        # ℹ️ ydl.sanitize_info makes the info json-serializable
        json.dump(ydl.sanitize_info(info), fh, indent=2)

# ℹ️ See help(yt_dlp.YoutubeDL) for a list of available options and public functions
ydl_opts = {}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    process_one(ydl, "https://www.youtube.com/watch?v=BaW_jenozKc", "one_video.json")

benoit74 commented 11 months ago

I also checked, chapters information is returned.

kelson42 commented 11 months ago

This improvement has been on the table since quite a long time. Anything stopping us to move forward?

rgaudin commented 11 months ago

Anything stopping us to move forward?

Time, priority 😉

benoit74 commented 11 months ago

As you might see, this is not even in the 2.2.0 release I'm preparing because there is already lower hanging fruits to tackle before this "big" change.

joe-rabbit commented 9 months ago

Hello , I would like to work on this? how do I go about with this thank you

rgaudin commented 9 months ago

Hello , I would like to work on this? how do I go about with this thank you

This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do. Unless that's something you are willing to do, I'd advise you look at other tickets

benoit74 commented 9 months ago

yt-dlp is already use to download the video, what we want is to use this also to get all information about the channels, users, playlists, videos, ..

The plan is :

identify all places where the scraper uses the youtube API to grab information about channels, users, playlists, videos
decide how this code could be refactored to use information from yt-dlp (this is really the hard part, we want to use the exact same inputs, produce the same ZIM in the end, but there is absolutely not a one-to-one match between YouTube API and yt-dlp)
implement the change

Parts one and two should be done without any coding.

And as Renaud said, this is hence a complex task, but definitely not infeasible if you are ready to spend some time on it.

Le mer. 27 déc. 2023, 19:52, rgaudin @.***> a écrit :

Hello , I would like to work on this? how do I go about with this thank you

This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do. Unless that's something you are willing to do, I'd advise you look at other tickets

— Reply to this email directly, view it on GitHub https://github.com/openzim/youtube/issues/177#issuecomment-1870550221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWF5CJFNBG5AWJQ6Q6SD3TYLRVATAVCNFSM6AAAAAA4HLSR2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZQGU2TAMRSGE . You are receiving this because you commented.Message ID: @.***>

joe-rabbit commented 9 months ago

i will try my best :)

benoit74 commented 2 months ago

I'm not so sure that moving everything to yt-dlp is a wise move, at least it needs to be discussed again because of disadvantages that have not been discussed here so far.

I agree that the advantage is obvious, no need to list them again.

It was however unclear to me until now that there is a big downside to moving everything to yt-dlp. The problem is that for yt-dlp operations, Zimfarm workers are sometimes blacklisted. We failed to understand the exact circumstances for now, but what we know is that the ban is temporary (few hours) and linked to the IP (they have nothing else to ban anyway).

Moving all operations from the YT API to yt-dlp means that we will be even more subject to this ban (more operations probably means more ban) AND the consequence of a ban will be more significant. If we implement #277 and we continue to use YT API instead of yt-dlp, it means that we can refresh the ZIM for UI enhancements typically without having to use yt-dlp at all if channel has not been updated, and hence not being impacted by a temporary ban.

I still consider that the advantages outweigh the disadvantages, especially since it is quite a rare edge cases that channel has been unchanged since last recipe execution, but I think it is very important all of us are aware of this before modifying too much code.

chapmanjacobd commented 2 months ago

even more subject to this ban (more operations)

While you are evaluating yt-dlp, and measuring the number of requests that it makes, I'd like to suggest turning on a few specific options for the initial metadata scan to reduce the number of network requests:

ydl_opts = {
    "skip_download": True,
    "lazy_playlist": True,
    "extract_flat": True,
}

This might be helpful:

https://github.com/chapmanjacobd/library/blob/f253959d6de2c980fe42238ede2b908ef762c4a8/xklb/createdb/tube_backend.py#L97

And then you can fan-out the more detailed video metadata fetching across many IPs

benoit74 commented 2 months ago

Very good point @chapmanjacobd, thank you for notifying us!

openzim / youtube

Use yt-dlp instead of Youtube API #177