shaka-project / shaka-packager

A media packaging and development framework for VOD and Live DASH and HLS applications, supporting Common Encryption for Widevine and other DRM Systems.
https://shaka-project.github.io/shaka-packager/
Other
1.95k stars 503 forks source link

Syntax for specifying CEA captions characteristics when packaging HLS #986

Open Canta opened 3 years ago

Canta commented 3 years ago

System info

Operating System: Ubuntu 18.04.5 LTS (dockerized) Shaka Packager Version: b7ef11f-release

Issue and steps to reproduce the problem

I'm deploying streams with CEA-608 closed captions. The captions do reach the players, and are correctly displayed. So far, so good. However, I can't find any HLS related syntax to tell the players a name for the captions (for example, its language). Therefore, this is what I see when playing:

image

That is not the case with audios, where I can use hls_name, and so this is what the players show:

image

And I would also like to point out that DASH does not have this problem, as part of the closed captions syntax I can specify the language, and that's relevant information for the players, as seen here:

image

That's the same h264 stream, with the same CEA-608 bytes as the previous screenshots, but packaged to DASH instead of HLS.

So, my question: is there any way to set an hls_name, or even language, to closed captions streams, for HLS?

As a side note, don't entirely sure if relevant, I've seen this line of code: https://github.com/google/shaka-packager/blob/master/packager/hls/base/master_playlist.cc#L255 There, it's stated that "cea is not supported as output, as it's just input". I disagree: CEA-608 captions are clearly part of the output, and what we need is a way to manually (it's not even any smart logic) tell shaka how to state captions characteristics in the HLS playlists. That line seem to hardcode "no captions", which is simply not true, without any way to change it. If shaka does not remove the captions, then it's wrong to say there are none, and we should have a way to tell the players what they need in order to manage the user experience correctly.

Packager Command:

packager \
"in=udp://127.0.0.1:12345, stream_selector=0,segment_template=/path/streamname/240_\$$Time%013d\$$_hls.ts,playlist_name=240.m3u8,drm_label=SD,bandwidth=512000,cc_index=0" 
"in=udp://127.0.0.1:12345, stream_selector=1,segment_template=/path/streamname/360_\$$Time%013d\$$_hls.ts,playlist_name=360.m3u8,drm_label=SD,bandwidth=768000,cc_index=0" 
"in=udp://127.0.0.1:12345, stream_selector=3,segment_template=/path/streamname/720_\$$Time%013d\$$_hls.ts,playlist_name=720.m3u8,drm_label=SD,bandwidth=3072000,cc_index=0" 
"in=udp://127.0.0.1:12345, stream_selector=4,segment_template=/path/streamname/audio_\$$Time%013d\$$_hls.ts,playlist_name=audio.m3u8,drm_label=SD,bandwidth=64000,language=spa,hls_name=Español" 
"in=udp://127.0.0.1:12345, stream_selector=5,segment_template=/path/streamname/audio2_\$$Time%013d\$$_hls.ts,playlist_name=audio2.m3u8,drm_label=SD,bandwidth=64000,language=eng,hls_name=English" 
--io_cache_size 10000000 
--hls_master_playlist_output /path/streamname/master.m3u8
--hls_playlist_type LIVE 
--segment_duration 3.2 
--time_shift_buffer_depth 30 
--preserved_segments_outside_live_window 30 
--default_language=spa 

Extra steps to reproduce the problem?

What is the expected result?

Some way to tell the players a description of the captions for HLS.

What happens instead?

The captions are shown as "unknown", shaka packager does not seem to have a syntax to change that, and also shaka packager seems to compulsively state "there are no captions" (even when that's not the case).

xavierlaffargue commented 11 months ago

@joeyparrish I would like to know if you are agree to add a feature to signalling the close caption, in HLS & Dash, for example in HLS (perhaps use cc_index argument) ?

#EXT-X-MEDIA:TYPE=CLOSED-CAPTIONS,GROUP-ID="cc1",LANGUAGE="en",NAME="English",DEFAULT=NO,AUTOSELECT=YES,INSTREAM-ID="CC1"

xavierlaffargue commented 11 months ago

I started working on this, but I need your advice @joeyparrish : I need to add a new stream selector (like close caption ?) and use cc_index in that?

xavierlaffargue commented 4 months ago

@joeyparrish @cosmin : I would like to help with this feature, but I don't know where start

cosmin commented 4 months ago

I'm fairly sure CEA-608 is not intended to be supported in the output, the canonical way to do this in Shaka would be to read in the CEA-608 subtitles and then write out a WebVTT for HLS where you can then specify the name.

I'm curious why would you want to use CEA-608 rather than WebVTT?

Canta commented 4 months ago

oh boy...

@cosmin

I'm curious why would you want to use CEA-608 rather than WebVTT?

TL;DR version

Well, the TL;DR, kinda-rude-yet-no-BS version, is quite simple: somebody asks us to put CEA-608 in the output, then we need CEA-608 in the output, period. You can debate that all you want, but neither your or mine oppinion is going to change that situation any time soon: we are workers, software are tools in this context, and we are being ordered to do stuff that we need to do if we want to keep paying the bills; we're not talking about doing what we want here. And if you think this is some inappropriate exaggeration, just think for a minute about DRM problems in production at 03:00 AM and most likely we'll end up hugging in tears remembering war scenes: that's the kind of sentiments here regarding CEA-608.

Long version

Now, the longer, complicated, more-insightful-and-apt-ish-for-a-debate version, is as follows: I work with 24/7 live streams. TV over internet. That's the context I opened this issue. It's an important detail, and I don't think is noted in the issue's description.

That said, there's a bias in the multimedia software ecosystem towards prioritizing VOD use cases over LIVE. You can find that everywhere from ffmpeg to the shaka project itself. "Prioritizing" may not be the proper word, but just not considering LIVE scenarios when thinking about use cases, making LIVE a de-facto second class citizen. And I believe your question is another of those "why don't you just do this instead of that" rationales, where "this" would be very simple for a VOD use case: "just convert CEA-608 to webvtt, why would you want CEA-608 at all?".

That's one dimension of problems. And yet for that dimension alone, I can tell you our input in the 24/7 live business is not that easy to handle. You have a server in a datacenter, and you need to do on-the-fly tasks: there's no "start" or "end" for a 24/7 live streaming, you need to do lots of stuff for the output to be up to the required standards, everything needs to be lightining fast OR ELSE, and 9 out of 10 times you can do nothing about your input which can reach you all kinds of broken (and when you ask for help online, the great minds always tell you "oh, but your input is broken", and act like if you should fix that first). You have different providers with different codecs, codec versions, and codec configurations, all around you are hardware black boxes like IRDs and encoders that ask you for thousands of dollars in order to do stuff like enabling an open source protocol or being able to transcode to another format, and those are the good days when stuff is just working as intended: you get dozens of megabits of something you can't customize, and your server needs to normalize and package that ASAP.

And the first part there is decoding. If you want to, let's say, convert some stuff to another stuff, you need to decode your input stuff first, which already consume lots of resources and adds some delay; you can't just do stuff multiple times in several steps, like "first I get audio, then I do this with the video, and after that let's do this other thing for text, so I can later mix it all". You need one tool that allows you to do all of that in one single step, fundamentally because you need to decode that input once.

Now, what do you need to do to decode CEA-608? Decode video, decode subtitles, decode captions, or what? Yeah, it seems "captions" and "subtitles" are two different things depending on who you're talking to, but more on that later. The point here is: if you want to on-the-fly convert INPUT CEA-608, you need to get and parse those bytes first, which means reading dozens of megabits per second and most likely block some CPU core when not using too much bandwidth from dedicated decoding hardware or even networking stuff; it's very easy to just trigger a parallel ffmpeg consuming the same multicast input just for some particular task, and then suddenly begin to see strange collateral effects on unrelated stuff because you're saturating the networking equipment just consuming too many times the same stuff. And that's when you actually have multicast, which let you do multiple simultaneous input connections easily! Let's not even consider the totally common case where you get 50 programs inside a single URL. So, unlike the VOD case where you can just add another tool to the pipeline, with LIVE either you have a good all-terrain swiss-army-knife kind of tool or you're in deep trouble.

Back then I talked about "VOD bias", but you actually seem to imply that the conversion between CEA-608 and WebVTT should be done by shaka packager. And how's that gonna be done, exactly? You see, CEA-608 is neither image nor text: it's interpreted bytes. It needs its own ad-hoc parsing and conversion tables, needs to contemplate several "modes" and dialects, there are utf-8 issues to work with when converting to text based subtitle files, and there's even implementation details missmatch between codecs like colors or positions in the screen or appearing/dissapearing behaviour that need to be interpreted before implemented in the output conversion. I'm even being respectful and saying "CEA-608" without mentioning 708, which is the actual superset in use.

And don't let me start on timestamps. You can get away with extracting CEA-608 from the H264 or MPEG2 decoder, but then you need to convert it to something else entirely which no longer travels along with the video packet (nor audio for the matter), and it doesn't even honor the same timing rationale. If you have a multicast input you'll most likely have it in MPEG-TS format, which in turn has its PTS rollover every 26.5 hours, and good luck going to sleep pretending to see your subtitles still there the next day when you do your own math converting PTSs that change on the input itself and without warning or explanation: you want to blame the rollover, yet some other random PTS came from the satellite and that's "now" from now on because who knows why, that's life, somebody pressed a "reset" button somewhere and your timestamps are nowhere near any rational point in time.

However, all of that's the case only when CEA-608 is your INPUT. You have several beasts running wild inside the datacenter, and CEA-608 is just one of them. There's also DVB subs, for example, which IIRC may or may not be image based. And DVB teletext, which is text but not subs, yet you can use it as subs, yet you need a different decoder. I already stated you don't control any of that: somebody just sent it to you, this people doesn't negotiate anything with you, and you need to handle it ASAP. Since shaka packager would be the one to deal with the conversion from CEA-608 to WebVTT, is it also going to do the same for the rest of the input subtitle formats using the same rationale of "why would you want anything else other than WebVTT"? If that's the case, you must really hate shaka packager in order to charge it with such unnecessary burden when nobody's asking it here to convert anything.

And that's another dimension of problems. You seem to imply that the conversion perhaps should be done by shaka, as if it's its responsability. Last time I checked, shaka packager had limited support for input/output subtitle combinations. Like, it did support dvb subtitles as input, but only outputting image based TTML as output. Maybe that changed with time (consider this issue is from 2021), but the point is shaka packager doesn't do OCR, and when you start charging subtitle format conversion resposabilities you're talking about OCR sooner than later. That, and of course the inverse case too: from text to image based subtitle formats. You'll need to have some intermediate/abstract internal format from where you do all the conversion magic. Are you in for that ride while philosophically debating software responsbilities, or you're just wanting to get rid of CEA-608?

But I feel you man: who would ever like CEA-608 having wonderful stuff like webvtt around? Well, you see, there is this other thing called "culture", and it seems that different folks are acustomed to different ways of life, and so there's people with TVs and TV-companion-devices that do CEA-608 just fine while webvtt not so much. And there's also stuff like "history" and "goverments", which in turn give birth to regulations and standards, and so you can't just stream what you believe to be fine if there's some rule somewhere that says "thou shall use CAPTIONS besides SUBTITLES" or something like that. You see, the text traveling in the CEA-608 bytes is not the same as the DVB subtitle/teletext/whatever also traveling in the same MPEG-TS program for the same TV channel, and both fulfill different and very contextual responsabilities. There's people telling you that "captions is used in the context of accesibility while subtitles is used in the context of translations (which of course are not the same as transcriptions, how could anybody even imply those are the same thing?)", there's also people using CEA-608 as different kinds of messages to the general public, and so on and so on. When you also consider South and Central America have very, very varying hardware from ancient to cutting-edge, you gotta respect your lower denominators when you find some.

Some experienced extra context for you to consider:

Back in 2021, I had to deal with compatible-enough subtitles ASAP, and was successful in doing my own dvbsub-ocr-to-cea608 ffmpeg filter, so could create cea-608 on-the-fly and so my peers and I could keep having a job. Why CEA-608 and not WebVTT? Because:

  1. CEA-608 travels as data inside the media packets, so it didn't needed any timing/sync logic: you get a byte, you put it in the current video frame, and that's it.
  2. Every tool downstream honors it, because none want to deal with it. (And when they do, most likely the only option is to just clear this data, not even interpret it).
  3. Apple standard considers it, so does their players, even with Fairplay there.
  4. ffmpeg lets you do OCR on-the-fly for image subtitle input, but doesn't let you use subtitle filters, so you can only handle that text as video/audio metadata.

I was very happy with it. Except shaka packager had this problem, and so HLS players had that other problem, and so I had to do playlists manipulations, which imply adding an extra packaging step with file system monitoring (or unreliable timers), all which complicated the deployments and added extra points of failure, added IO operations in servers that are serving files, atcetera, just because there's not a captions-language=something option.

Eventually had to scratch it all and at the end of the day found a way to do WebVTT: first with another custom ffmpeg filter made by myself, then with an entire ffmpeg fork with subtitle filtering added made by some amazing fella from Germany I've found online and had much better OCR quality than mine. That adventure had a happy ending, but it was a total struggle, not everyone in my team could tell the tale at the end of the road, and this silly thing reported in this issue was a pain.

In conclusion

Now, why would ANYONE, ANYWHERE, be WANTING to deal with ANY of that mess, is a mistery to me as well as it seems to be to you. But with all that considered, the first rule of CEA-608 is: you just let it there when it's there and don't touch anything; it's already in its proper place at the proper time, it already says what it needs to say in the way it needs to be said, and you don't have to make it better. It's what every sane software video decoder/encoder do, and it's what shaka packager also do. Which leads us to this other comment:

I'm fairly sure CEA-608 is not intended to be supported in the output

Except it is supported: intended or not, it's already there in the output (because shaka packager is sane enough to not try to do anything with this thing), and with MPEG-DASH we already can describe that data without problems. We're not asking to debate the subtitles canon and its engineering details: we just want a silly command line flag for a silly string injection into another string (the playlists).

So, it would be REALLY nice to have shaka packager just let us apply a description for the already present CEA-608 output bytes when we need to deal with CEA-608 in the output.

I've done several modifications to shaka packager, and back then though of adding this extra option myself, but in 2021 the project was already kinda unmaintained and lacking muscle so it seemed like the wrong choice to also add to myself the responsability of dealing with that code maintainance. Now that the project is back on its feet and there's somebody willing to give this issue some love, please, PLEASE, just let that good soul add that frigging extra option so the next person needing to deal with CEA-608 output doesn't need to suffer any of this.

cosmin commented 4 months ago

Pull requests are certainly welcome.