jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
3k stars 337 forks source link

Live Stream transcripts #98

Open frisch1 opened 3 years ago

frisch1 commented 3 years ago

Hello. This is a feature request vs bug, methinks.

Have you looked at extracting captions from a live stream. If you look at any example (https://www.youtube.com/whitehouse) of a live stream, while the stream is live (key), there are auto-generated subtitles delivered in the videoplayback file that streams in, embedded e.g.

https://r6---sn-8xgp1vo-p5qy.googlevideo.com/videoplayback?expire=1614211486&ei=PpU2YM-ULYm98wTm0L_gDA&ip=71.246.232.10&id=yhxmnlGtJ-g.1&itag=386&source=yt_live_broadcast&requiressl=yes&mh=zc&mm=44,29&mn=sn-8xgp1vo-p5qy,sn-p5qs7nel&ms=lva,rdu&mv=m&mvi=6&pl=18&initcwndbps=1717500&vprv=1&live=1&hang=1&noclen=1&xtags=lang=en:ttkind=asr&mime=text/mp4&ns=aD6U7aY6idhNPyXEqiXu6K0F&gir=yes&mt=1614189620&fvip=6&keepalive=yes&fexp=23983797&beids=9466586&c=WEB&n=lmOMV3MuzrpzRQ&sparams=expire,ei,ip,id,itag,source,requiressl,vprv,live,hang,noclen,xtags,mime,ns,gir&sig=AOq0QJ8wRAIgd0qHHqBF3aRir-pw93UKhFNuFxrlpe6OqyMerxsZ4JsCIHZK74UbKX7ig08-egt6vMDzP6g_7EhOyuOOoUXAkSVW&lsparams=mh,mm,mn,ms,mv,mvi,pl,initcwndbps&lsig=AG3C_xAwRAIgHa9tABbFKMiVQSnLLWa7iO_iu7pcVtrea43G-zdfGBUCIGbqOL15uN0-32Yki8s5vwXD2XDkvCBUgntS54w9xvjc&alr=yes&cpn=LW2TAYe5jfbjzMjx&cver=2.20210223.09.00&sq=664

Expired, of course, but an example, the payload here is:

ftypmoovlmvhd_@(mvex trex}trak\tkhd@mdia mdhd_UÄ!hdlrtextÐminf$dinfdrefurl ˜stblHstsd8tx3g
ftabsttsstscstcostszVnmhdemsghttp://youtube.com/streaming/metadata/segment/102015ˆ°D«Sequence-Number: 664
Stream-Finished: F
Ingestion-Walltime-Us: 1614189870022158
Stream-Duration-Us: 3320017000
Max-Dvr-Duration-Us: 14400000000
Target-Duration-Us: 5000000
Encoding-Alias: L1_Ag

Xmoofmfhd@traftfhdß’tfdtÏYztrun`^mdat<?xml version="1.0" encoding="utf-8" ?><timedtext format="3">
<body>
<p t="0" d="345">what&#39;s in the Declassified
report or when it comes out</p>
<p t="345" d="3750">because many elements of Italy
two years ago when when it was</p>
<p t="4095" d="910">first first came out if you come
to the conclusion that there</p>
</body>
</timedtext>

The timedtext is embedded in the file:

<?xml version="1.0" encoding="utf-8" ?><timedtext format="3">
<body>
<p t="0" d="345">what&#39;s in the Declassified
report or when it comes out</p>
<p t="345" d="3750">because many elements of Italy
two years ago when when it was</p>
<p t="4095" d="910">first first came out if you come
to the conclusion that there</p>
</body>
</timedtext>

It's not TTMLv3 but we get this text is associated with sequence #664 from the URL. The t= appears to be millisecond designation relative to the sequence chunk, and "d" appears to be the duration. But even absent that, the stream of text is there. Note it doesn't appear by default. It appears you need to insert into the "sparams" in the URL "xtags" to get the live captioning, but it appears if you try to insert it, it messes up the hash/key associated with it so it needs to be triggered on (cc_load_policy=1 in URL does NOT seem to work)

youtube-dl et al don't recognize this since it's not being delivered as a standalone subtitle file. Acts like there's no subtitles on the live stream since it doesn't identify as a subtitles file.

Thoughts?

jdepoix commented 3 years ago

Hi @frisch1, I would definitely say that this is a feature request and not a bug. Sounds interesting, but I don't see myself implementing this anytime soon, as this module is mostly used for data-science purposes and I don't really see the use-case for livestreams. However, if you want to contribute this feature I'd be happy to merge it. Deserializing the response probably isn't a big deal, you just gotta find out how to scrape the URL you'll have to call to actually get that response. Let me know if you have that figured out and are interested in contributing it, so we can have a chat on how to implement this into the current API 😊