YouTube captioning - Githubissues

nedtwigg commented 7 years ago

We can upload captions to YouTube, and they have an automated service to automatically sync a transcript with the video:

https://support.google.com/youtube/answer/2734796?hl=en

You can download this synced transcript via the API:

https://developers.google.com/youtube/v3/docs/captions/download

It's also possible to force the captions to display in the embed:

http://www.3playmedia.com/2014/05/13/force-closed-captions-appear-youtube-videos/

We're going to a need a way to search transcripts to make it easy to find the clip you want, and then automatically sync the video to that spot in the transcript. Looks like YouTube has all the raw pieces of that, but it will be a decent chunk of work to tie them all together in an easy-to-use way.

WebsByTodd commented 7 years ago

We need a YouTube account for MyTake. YouTube has a Nonprofit Program with a couple perks:

Use donation cards on videos
Link directly to our site from videos
Use their resources to produce and optimize content

I'm not sure if we're eligible yet since our 501(c)(3) application is still in progress. The guidelines say simply that we must be "registered as a charitable organization" and also "registered with the local TechSoup partner".

I've downloaded the 2 videos we use in the current sample article using KeepVid Pro and trimmed them. The trial version lets you download 2 videos at 720p, so that's what I've done. Licenses are roughly $30/year. Business licenses are a few dollars cheaper.

I'll document the steps for captioning a video and I'll use my personal YouTube channel for now to make progress on this feature until we figure out the MyTake Google account situation.

WebsByTodd commented 7 years ago

I'm updating this thread based on a phone conversation with Ned.

When I uploaded the videos to YouTube, they automatically captioned them with a few mistakes here and there ("Patrice brach the trees" instead of "Patrice Brock, Patrice" or "daytime" instead of "detente"). The automatic captions can be downloaded in 3 different forms, one of which has a timestamp associated with each spoken word (.vtt).

Our plan is to download the .vtt file, edit it for accuracy, and programmatically load/transform it into our data model as an object that maps words to timestamps. I'm doing this now with a custom webpack loader.

We also need to map the speakers in the video to a range of spoken words. For example, in the second Trump/Hillary debate Martha Raddatz spoke the first 8 words {speaker: 'Raddatz', range: [0-7]}, Anderson Cooper spoke the next 81 words {speaker: 'Cooper', range: [8-89]}, etc. This step has to be done manually and in the future we may build an interface for contributors to be able to do this easily.

From this data model we can build a nice UI with captions that sync to the video as it's being played by autoscrolling the text, and also if the user selects some text or clicks some text the video will auto scrub to the proper time. This will also allow the user to perform advanced text searches, e.g. show every time Trump said "wall".

WebsByTodd commented 7 years ago

This is pretty much done. Check out the preview and click the 2nd Trump/Hillary debate for the working example.

Other than fine tuning, next steps to roll this out are to upload all the debates with captions to our YouTube channel and add the speaker => wordRange map for each of them to our data model.

WebsByTodd commented 7 years ago

It took me 6 hours to caption the 2nd Trump/Hillary debate properly for our data model, which is too long. The debate is just over an hour and a half long, and the method I used required the video to be watched twice from start to finish to get all the data we need, which is about 3 hours. Add some time for note taking/formatting and this should have taken no more than 4 hours. But it took 6 hours.

The mistake I made was relying on YouTube's word recognition in the automatically generated captions, which struggles with proper nouns, soft-spoken words, and is useless when multiple people are trying to talk at the same time.

There are many different places online where we can find full transcripts. Debates.org has all transcripts prior to 2016. It shouldn't be difficult to write a script to convert YouTube's automatically generated .vtt file into just the text of what was spoken and to diff this with the text of the full transcripts. If we can get the formatting to be close, the diff will shake out a lot of proper noun and soft-spoken word issues.

Then we'll need to add text where multiple people are speaking at the same time and timestamp each word in .vtt format. The xml format is ugly and slow to work with, but when all of the words are known beforehand it's a bit easier.

Once the .vtt file is 100% correct, then we need to determine the word-range of each speaker.

{
  speaker: "Raddatz",
  range: [0, 7]
},
{
  speaker: "Cooper",
  range: [8, 89]
},
{
  speaker: "Raddatz",
  range: [90, 219]
},
...

We already have a script that gives us the index of each word in the .vtt file. Cross referencing this with the transcript we got from a place like debates.org we should be able to create this word/range map pretty easily.

The very last step is to watch the video and make sure our data model is accurate, especially around speaker transitions.

I'll try this with our next sample debate, the 2nd Carter/Ford debate, tomorrow. Maybe they won't talk over each other as much as Trump and Hillary did.

Once I get the process nailed down, I'll document it in the wiki so others can follow it for the remaining videos.

nedtwigg commented 6 years ago

This is now implemented! There's a performance follow up in #115, a minor bugfix in #104, and plenty of work to do for easier tooling in #122, but we've got a first cut.

You can see the result here: https://mytake.org/foundation

Donald Trump - Hillary Clinton 2/3
Jimmy Carter - Gerald Ford 2/3

mytakedotorg / mtdo

YouTube captioning #74