Open rgaudin opened 1 year ago
I did small tests of yt-dlp
. They are very positive.
Test context : Python 3.11.4, yt-dlp 2023.10.7
Is it working for weird channel names like
@Madrasa
which does not work without a channel ID
Yes, it even found 5698 videos ...
Is it working for
user
ID types (old ones, instead ofchannel
)
Yes, it does not make a difference (tested for DirtyBiology
which is a channel
and cestpassorcierofficiel
which is a user
)
How to extract all information mentioned above ?
I used this code:
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=False)
One can also pass "process=False" to not retrieve details (e.g. when you want only author info, not all his videos)
You can use following URL:
https://www.youtube.com/watch?v=BaW_jenozKc
=> get details about one videohttps://www.youtube.com/@PhilippHagemeister
=> get details about one channel / user and all its videoshttps://www.youtube.com/@PhilippHagemeister/playlists
=> get details about all playlists, and all videos in every playlisthttps://www.youtube.com/channel/UCtqICqGbPSbTN09K1_7VZ3Q
=> get details about one channel by IDhttps://www.youtube.com/playlist?list=PL5Pd1geIk9IUBWUoUUNyBehNl0q5D1IuE
=> get details about one playlist and its videosThe main limitation is that is seems hard to request only few data (e.g. get the list of all playlists but not the videos which are within).
Whole code used for the tests:
import json
import yt_dlp
def process_one(ydl, url, filename):
info = ydl.extract_info(url, download=False, process=False)
with open(filename, "w") as fh:
# ℹ️ ydl.sanitize_info makes the info json-serializable
json.dump(ydl.sanitize_info(info), fh, indent=2)
# ℹ️ See help(yt_dlp.YoutubeDL) for a list of available options and public functions
ydl_opts = {}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
process_one(ydl, "https://www.youtube.com/watch?v=BaW_jenozKc", "one_video.json")
I also checked, chapters information is returned.
This improvement has been on the table since quite a long time. Anything stopping us to move forward?
Anything stopping us to move forward?
Time, priority 😉
As you might see, this is not even in the 2.2.0 release I'm preparing because there is already lower hanging fruits to tackle before this "big" change.
Hello , I would like to work on this? how do I go about with this thank you
Hello , I would like to work on this? how do I go about with this thank you
This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do. Unless that's something you are willing to do, I'd advise you look at other tickets
yt-dlp is already use to download the video, what we want is to use this also to get all information about the channels, users, playlists, videos, ..
The plan is :
Parts one and two should be done without any coding.
And as Renaud said, this is hence a complex task, but definitely not infeasible if you are ready to spend some time on it.
Le mer. 27 déc. 2023, 19:52, rgaudin @.***> a écrit :
Hello , I would like to work on this? how do I go about with this thank you
This ticket involves a large refactor of the codebase. It requires a good understanding of the current codebase and a detailed breakdown of how you'd do. Unless that's something you are willing to do, I'd advise you look at other tickets
— Reply to this email directly, view it on GitHub https://github.com/openzim/youtube/issues/177#issuecomment-1870550221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWF5CJFNBG5AWJQ6Q6SD3TYLRVATAVCNFSM6AAAAAA4HLSR2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZQGU2TAMRSGE . You are receiving this because you commented.Message ID: @.***>
i will try my best :)
I'm not so sure that moving everything to yt-dlp is a wise move, at least it needs to be discussed again because of disadvantages that have not been discussed here so far.
I agree that the advantage is obvious, no need to list them again.
It was however unclear to me until now that there is a big downside to moving everything to yt-dlp. The problem is that for yt-dlp operations, Zimfarm workers are sometimes blacklisted. We failed to understand the exact circumstances for now, but what we know is that the ban is temporary (few hours) and linked to the IP (they have nothing else to ban anyway).
Moving all operations from the YT API to yt-dlp means that we will be even more subject to this ban (more operations probably means more ban) AND the consequence of a ban will be more significant. If we implement #277 and we continue to use YT API instead of yt-dlp, it means that we can refresh the ZIM for UI enhancements typically without having to use yt-dlp at all if channel has not been updated, and hence not being impacted by a temporary ban.
I still consider that the advantages outweigh the disadvantages, especially since it is quite a rare edge cases that channel has been unchanged since last recipe execution, but I think it is very important all of us are aware of this before modifying too much code.
even more subject to this ban (more operations)
While you are evaluating yt-dlp, and measuring the number of requests that it makes, I'd like to suggest turning on a few specific options for the initial metadata scan to reduce the number of network requests:
ydl_opts = {
"skip_download": True,
"lazy_playlist": True,
"extract_flat": True,
}
This might be helpful:
And then you can fan-out the more detailed video metadata fetching across many IPs
Very good point @chapmanjacobd, thank you for notifying us!
Youtube Data API v3 has served us relatively well for 4 years now. I think it's time to move away from it because:
youtube-dl and the fork we use (
yt-dlp
) has greatly improved in 4y. Switching to it would have the following benefits:This change would require an important revamp of the scraper but partly because it's still a filesystem-based one
Important feature check list to test/poc first: