Open rom1504 opened 1 year ago
related : https://github.com/iejMac/video2dataset
can use yt-dlp _match_valid_url
from yt_dlp.extractor import gen_extractor_classes, GenericIE
def is_supported(url):
for ie in gen_extractor_classes():
if ie != GenericIE and ie.suitable(url):
return True
return False
is_supported("https://www.youtube.com/watch?v=i_xBWhJB6VM")
is_supported("https://tv.naver.com/v/31992728/list/67096")
is_supported("https://static1.bigstockphoto.com/thumbs/2/3/2/large2/23261459.jpg")
advised by yt-dlp maintainer however may miss GenericIE urls like "direct manifest URLs, webpages with youtube embeds etc"
sadly seems to slow, will need something more approximate
but actually also it seems high recall but low precision catching lot of platform links that could contain videos but do not
better idea: collect a bunch of positive and negative links, and build regexes or a very cheap predictor to know which are good
Best way to do this
https://gist.github.com/rom1504/f1f8fd253def49ce02a990229d7bf09d some work on this
https://github.com/v2fly/domain-list-community/tree/master/data might be interesting
limited version for 3 platforms (but which works) :
import re
def is_dailymotion_video(url):
if re.match('^https?://www.dailymotion.com/video/.+$', url):
return True
return False
def is_vimeo_video(url):
if re.match('^https?://vimeo.com/[0-9]+$', url):
return True
if re.match('^https?://player.vimeo.com/video/[0-9]+.*$', url):
return True
return False
def is_youtube_video(url):
if re.match('^https?://(www.)?youtube.com/watch\?v=.+$', url):
return True
if re.match('^https?://(www.)?youtube.com/v/.+$', url):
return True
if re.match('^https?://(www.)?youtube.com/embed/.+$', url):
return True
if re.match('^https?://(www.)?youtu.be/.+$', url):
return True
return False
https://ytdl-org.github.io/youtube-dl/supportedsites.html