rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
303 stars 23 forks source link

support video platform #27

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

https://ytdl-org.github.io/youtube-dl/supportedsites.html

rom1504 commented 1 year ago

related : https://github.com/iejMac/video2dataset

rom1504 commented 1 year ago

can use yt-dlp _match_valid_url

rom1504 commented 1 year ago
from yt_dlp.extractor import gen_extractor_classes, GenericIE

def is_supported(url):
    for ie in gen_extractor_classes():
        if ie != GenericIE and ie.suitable(url):
            return True
    return False

is_supported("https://www.youtube.com/watch?v=i_xBWhJB6VM")
is_supported("https://tv.naver.com/v/31992728/list/67096")
is_supported("https://static1.bigstockphoto.com/thumbs/2/3/2/large2/23261459.jpg")

advised by yt-dlp maintainer however may miss GenericIE urls like "direct manifest URLs, webpages with youtube embeds etc"

rom1504 commented 1 year ago

sadly seems to slow, will need something more approximate

rom1504 commented 1 year ago

but actually also it seems high recall but low precision catching lot of platform links that could contain videos but do not

rom1504 commented 1 year ago

better idea: collect a bunch of positive and negative links, and build regexes or a very cheap predictor to know which are good

rom1504 commented 1 year ago

Best way to do this

  1. Run cc2dataset without filter or using a very broad filter using yt-dlp filters (eg #36 ) on a few shards
  2. Run yt-dlp / video2dataset on the result, that gives working and non working links
  3. Use the result as a test set to build a "platform from url" classifier
  4. Url that classifier in cc2dataset to get many links from many platforms
rom1504 commented 11 months ago

https://gist.github.com/rom1504/f1f8fd253def49ce02a990229d7bf09d some work on this

rom1504 commented 11 months ago

https://github.com/v2fly/domain-list-community/tree/master/data might be interesting

rom1504 commented 10 months ago

limited version for 3 platforms (but which works) :

import re
def is_dailymotion_video(url):
  if re.match('^https?://www.dailymotion.com/video/.+$', url):
    return True

  return False

def is_vimeo_video(url):
  if re.match('^https?://vimeo.com/[0-9]+$', url):
    return True
  if re.match('^https?://player.vimeo.com/video/[0-9]+.*$', url):
    return True

  return False

def is_youtube_video(url):
  if re.match('^https?://(www.)?youtube.com/watch\?v=.+$', url):
    return True
  if re.match('^https?://(www.)?youtube.com/v/.+$', url):
    return True
  if re.match('^https?://(www.)?youtube.com/embed/.+$', url):
    return True
  if re.match('^https?://(www.)?youtu.be/.+$', url):
    return True

  return False