r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
21 stars 6 forks source link

YouTube Transcripts #53

Open nkandpa2 opened 5 months ago

nkandpa2 commented 5 months ago

Videos on YouTube can optionally be published under a CC-BY license. We can identify these videos with the YouTube API, download them, and transcribe them with an ASR system like whisperX.

nkandpa2 commented 4 months ago

Code for cataloging CC YouTube videos can be found in this repo: https://github.com/nkandpa2/youtube-commons. About 800K CC videos adding up to about 300K hours of primarily speech-based video have been cataloged.

storytracer commented 3 months ago

The US government offers a search of government YouTube videos: https://find.search.gov/search/news?affiliate=usagov_all_gov&channel=10448&query= . All these videos are produced by government agencies and are Public Domain and they seem to be all hosted on YouTube. Public Domain is not a choice you can select in the upload interface when you upload a video to YT, you can only choose between the YouTube standard license and CC, that’s why these videos can’t be found through the YouTube API. So these ~ 200K Public Domain videos findable through this search index could also be added to your YouTube Commons collection @nkandpa2 and processed with Whisper.

I can't find a public-facing search API for find.search.gov, there seem to only exist APIs for internal use. So the search.gov video index would have to be web-scraped to get a link list for all the government Youtube videos.

craffel commented 3 months ago

Can we find any explicit statement that these videos are public domain, or do we need to rely on the reasoning that "these were made by the government, therefore they are public domain"

storytracer commented 3 months ago

No, I was not able to find any explicit statements inside the government YouTube channels so far that the videos they host are in the public domain except by using the full-text search to search for "Public Domain" which only finds 3,000 videos: https://find.search.gov/search/news?affiliate=usagov_all_gov&channel=10448&sort_by=&query=%22public+domain%22.

The only explicit statement about a channel I was able find so far is on the USDA website, which states that all the content in their YouTube channel is in the public domain in a privacy impact assessment (p. 4): "Video content published to the USDA YouTube page will be previously approved by relevant Department and Office of Communications leadership, and will be available in the public domain." But while USDA makes this statement in this random document on their website, they do not include this statement in their YouTube channel or the videos themselves, which are indexed through find.search.gov.

Generally speaking, government videos are rarely expliclity declared to be in the public domain. But that's the standard case for federal government documents as well. Outside the cultural heritage sector works are rarely explicitly declared to be in the public domain with something like the PD Mark and even in the CH sector the use of a PD mark is much less common than CC licenses.

US Federal Government works are "born in the public domain" by law, they don't become part of the public domain at some later point through a waiver or expiration of rights. But such a general legal status is of course harder to document than an explicit rights statement, so it's worth evaluating whether it's worth the effort to transcribe government videos in addition to CC videos.

nkandpa2 commented 2 months ago

This is a great idea @storytracer. I think we can cover many of these videos by simply including the major US agencies' YouTube channels in the dataset. Looking through a couple of them from the search results you provided, channels sometimes will explicitly state in the description something along the lines of "Here you will find original content produced by [AGENCY]". So to me it feels reasonable to count this data as PD. We can always remove this later if we are unsure of the PD status.

storytracer commented 1 week ago

@nkandpa2 What's the current state of cataloging? Do you have an ETA and do think you could upload the catalog as a dataset to HF? I would like to help with the distributed processing of the audio files on our infrastructure!