openzim / youtube

Create a ZIM file from a Youtube channel/username/playlist
GNU General Public License v3.0
46 stars 26 forks source link

Use yt-dlp instead of Youtube API #177

Open rgaudin opened 1 year ago

rgaudin commented 1 year ago

Youtube Data API v3 has served us relatively well for 4 years now. I think it's time to move away from it because:

youtube-dl and the fork we use (yt-dlp) has greatly improved in 4y. Switching to it would have the following benefits:

This change would require an important revamp of the scraper but partly because it's still a filesystem-based one

Important feature check list to test/poc first:

benoit74 commented 11 months ago

I did small tests of yt-dlp. They are very positive.

Test context : Python 3.11.4, yt-dlp 2023.10.7

Is it working for weird channel names like @Madrasa which does not work without a channel ID

Yes, it even found 5698 videos ...

Is it working for user ID types (old ones, instead of channel)

Yes, it does not make a difference (tested for DirtyBiology which is a channel and cestpassorcierofficiel which is a user)

How to extract all information mentioned above ?

I used this code:

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
  info = ydl.extract_info(url, download=False)

One can also pass "process=False" to not retrieve details (e.g. when you want only author info, not all his videos)

You can use following URL:

The main limitation is that is seems hard to request only few data (e.g. get the list of all playlists but not the videos which are within).

Whole code used for the tests:

import json
import yt_dlp

def process_one(ydl, url, filename):
    info = ydl.extract_info(url, download=False, process=False)
    with open(filename, "w") as fh:
        # ℹ️ ydl.sanitize_info makes the info json-serializable
        json.dump(ydl.sanitize_info(info), fh, indent=2)

# ℹ️ See help(yt_dlp.YoutubeDL) for a list of available options and public functions
ydl_opts = {}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    process_one(ydl, "https://www.youtube.com/watch?v=BaW_jenozKc", "one_video.json")
benoit74 commented 11 months ago

I also checked, chapters information is returned.

kelson42 commented 11 months ago

This improvement has been on the table since quite a long time. Anything stopping us to move forward?

rgaudin commented 11 months ago

Anything stopping us to move forward?

Time, priority 😉

benoit74 commented 11 months ago

As you might see, this is not even in the 2.2.0 release I'm preparing because there is already lower hanging fruits to tackle before this "big" change.

joe-rabbit commented 9 months ago

Hello , I would like to work on this? how do I go about with this thank you

rgaudin commented 9 months ago

Hello , I would like to work on this? how do I go about with this thank you

This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do. Unless that's something you are willing to do, I'd advise you look at other tickets

benoit74 commented 9 months ago

yt-dlp is already use to download the video, what we want is to use this also to get all information about the channels, users, playlists, videos, ..

The plan is :

Parts one and two should be done without any coding.

And as Renaud said, this is hence a complex task, but definitely not infeasible if you are ready to spend some time on it.

Le mer. 27 déc. 2023, 19:52, rgaudin @.***> a écrit :

Hello , I would like to work on this? how do I go about with this thank you

This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do. Unless that's something you are willing to do, I'd advise you look at other tickets

— Reply to this email directly, view it on GitHub https://github.com/openzim/youtube/issues/177#issuecomment-1870550221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWF5CJFNBG5AWJQ6Q6SD3TYLRVATAVCNFSM6AAAAAA4HLSR2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZQGU2TAMRSGE . You are receiving this because you commented.Message ID: @.***>

joe-rabbit commented 9 months ago

i will try my best :)

benoit74 commented 2 months ago

I'm not so sure that moving everything to yt-dlp is a wise move, at least it needs to be discussed again because of disadvantages that have not been discussed here so far.

I agree that the advantage is obvious, no need to list them again.

It was however unclear to me until now that there is a big downside to moving everything to yt-dlp. The problem is that for yt-dlp operations, Zimfarm workers are sometimes blacklisted. We failed to understand the exact circumstances for now, but what we know is that the ban is temporary (few hours) and linked to the IP (they have nothing else to ban anyway).

Moving all operations from the YT API to yt-dlp means that we will be even more subject to this ban (more operations probably means more ban) AND the consequence of a ban will be more significant. If we implement #277 and we continue to use YT API instead of yt-dlp, it means that we can refresh the ZIM for UI enhancements typically without having to use yt-dlp at all if channel has not been updated, and hence not being impacted by a temporary ban.

I still consider that the advantages outweigh the disadvantages, especially since it is quite a rare edge cases that channel has been unchanged since last recipe execution, but I think it is very important all of us are aware of this before modifying too much code.

chapmanjacobd commented 2 months ago

even more subject to this ban (more operations)

While you are evaluating yt-dlp, and measuring the number of requests that it makes, I'd like to suggest turning on a few specific options for the initial metadata scan to reduce the number of network requests:

ydl_opts = {
    "skip_download": True,
    "lazy_playlist": True,
    "extract_flat": True,
}

This might be helpful:

https://github.com/chapmanjacobd/library/blob/f253959d6de2c980fe42238ede2b908ef762c4a8/xklb/createdb/tube_backend.py#L97

And then you can fan-out the more detailed video metadata fetching across many IPs

benoit74 commented 2 months ago

Very good point @chapmanjacobd, thank you for notifying us!