shaka-project / shaka-player

JavaScript player library / DASH & HLS client / MSE-EME player
Apache License 2.0
7.12k stars 1.33k forks source link

Selecting an audio track by id? #924

Closed bhh1988 closed 4 years ago

bhh1988 commented 7 years ago

Correct me if I'm wrong, but currently it looks like the only way to select audio tracks is to call selectAudioLanguage(language, role), or selectVariantTrack(track).

What happens if there are multiple audio tracks, with the same language and no role? Now I cannot use selectAudioLanguage because it's ambiguous. In the mpd manifest, there's a representation-id for each audio track so it's well-defined how to select a particular audio. But via shaka-player's API, it's not as possible. I think the closest thing that I can currently do is call getVariantTracks(), find the variant track that has the same video quality as the currently-playing variant but with a different audio track, and then call selectVariantTrack() with that track. This is cumbersome, and it's not really convincing that it'll work right - will adaptive switching remember to keep the audio the same? How do the audio-ids correspond to the manifest's representation ids?

One suggestion is to allow selecting the audio track by some sort of ID, but I'm open to other solutions - do people have any suggestions for how to address this type of use-case?

Our application processes videos coming from random users, and we cannot guarantee that all the audio in the video has distinct languages/roles. Currently we're resorting to putting a unique value (e.g. "audio1", "audio2") for the "role" in the manifest, in order to ensure uniqueness across all audio tracks in a video, but this is a non-standard abuse of the "role" property in the MPD, and only happens to work because shaka still reads and accepts non-standard role values.

joeyparrish commented 7 years ago

If there are multiple audio tracks with the same language and no role, we have no way to differentiate between them other than by ID. If they contain different content, they need to be differentiated either by language or role. Otherwise, the ABR system may choose to adapt between them.

If the audio tracks contain different content (such as different languages, or main audio vs commentary, etc), and there is no difference in either the language or role attributes, then this would be bad content. There would be no way for the ABR system to handle them correctly.

bhh1988 commented 7 years ago

Thanks for answering. Should shaka-player then be adapting within the same id instead then? That seems to be what the dash spec would require (adapting within the same representation ID).

Given our constraints, do you recommend I just proceed with my current strategy of putting unique values for the role of each track, even though they are non-standard values? I feel a little uncomfortable relying on the fact that Shaka currently accepts non-standard role values

joeyparrish commented 7 years ago

I don't think I understand what you mean when you say "adapting within the same representation ID". Do you mean within the AdaptationSet?

I was under the impression that there were no "standard" set of role values. The DASH spec has this to say on roles:

Section 5.3.3 Adaptation Sets Section 5.3.3.1 Overview

The values for the elements Role, Accessibility, Viewpoint and Rating are generally not provided within the scope of this part of ISO/IEC 23009. However, a number of simple schemes are defined in 5.8.5.

Section 5.8.4.2 Role

For the element Role the @schemeIdUri attribute is used to identify the role scheme employed to identify the role of the media content component. Roles define and describe characteristics and/or structural functions of media content components.

One Adaptation Set or one media content component may have assigned multiple roles even within the same scheme.

This part of ISO/IEC 23009 defines a simple role scheme in 5.8.5.5.

In addition, this part of ISO/IEC 23009 defines other roles schemes to support signalling for multiple view signals in 5.8.5.6.

5.8.5.5, then is where the "standard" roles are defined:

Section 5.8.5.5 DASH role scheme

The URN "urn:mpeg:dash:role:2011" is defined to identify the role scheme defined in Table 22. Note that Role@value shall be assigned to Adaptation Sets that contain a media component type to which this role is associated.

Table 22 then goes on to define "caption", "subtitle", "main", "alternate", "supplementary", "commentary", and "dub" roles.

So it would seem that there are, in fact, some small set of "standard" roles in DASH, but in general, any scheme may be used with any values. We don't check the scheme, and it wouldn't benefit anyone (as far as I can tell) for Shaka Player to attempt to validate the role values we find in the manifest. Would we reject a manifest with unrecognized roles? Ignore those roles? Would this surprise the app developer, who then has to modify the library to support some new scheme which is explicitly allowed by the spec?

In any case, I don't think you should be putting a unique value into the role of each audio AdaptationSet or Representation. Instead, you should put values with some semantic meaning that makes sense for your application and content. If you are concerned about the introduction of new values, you could use one of the spec'd values from Section 5.8.5.5 Table 22, or you could use a custom schemeIdUri value so that you are free to make up your own roles.

Does this make sense?

bhh1988 commented 7 years ago

Thanks @joeyparrish . Yea I mean adapting within the same adaptation set, not adapting within the same representation ID.

Ok I understand what you mean, and I feel more comfortable about the stability of shaka-player's current behavior for non-standard role values, but one thing to be clear is that it's hard in my use-case to find an appropriate "semantic meaning" for each audio track, because the video files we receive are arbitrary and essentially like a black box - we have no idea what the purpose of each audio track is for. Essentially the only semantic meaning we can derive is their order, so basically name the roles as "1", "2", "3", etc.

The fact that we have to resort to naming our roles "1", "2", "3" feels artificial, and essentially an artifact of the way shaka-player does things. Since shaka-player requires you to uniquely define audio tracks by (language, role) tuples, this forces us to make those tuples unique for every audio track. This is why I posed the question in #947 about exposing the adaptation-set IDs in the variants. I don't have enough context to know if abstracting those IDs away in the variants was a conscious decision though...

joeyparrish commented 7 years ago

I see what you mean about AdaptationSets. In v2.0.x and earlier, we didn't do things in terms of variants. Instead, we would keep streams (Representations) grouped into stream sets (AdaptationSets) internally. So you would not switch outside of your AdaptationSet, but I can't recall whether there was a good way to choose an arbitrary one.

When we moved our internal models to variants to support HLS, we essentially mapped DASH onto HLS. (Going the opposite direction was not feasible.) The loss of AdaptationSet information was a consequence of that, but we were not aware of any use cases where this would be a problem.

I'd love to discuss this further and try to better understand your situation and how we can better meet your needs. In the mean time, is the use of numbered roles sufficient for your purposes as a workaround?

bhh1988 commented 7 years ago

Yes, we can work with the numbered roles for now.

I can explain in a little more detail about our situation. We store arbitrary files uploaded from customers and if we detect the file is a video file, we transcode it into a format that can be viewed on the browser (with shaka-player), so that when users try to "view" the file in the browser, they can watch the contents of the video. On our end, we have no idea what the purpose of the file is and have no semantic information on the individual tracks of the file. If the file happens to have multiple audio tracks, we cannot know which audio tracks are commentary or secondary or main. So using the "role" DASH parameter is inappropriate to us, because we don't know what the roles of each track are. Ideally, our manifest would look something like:

<AdaptationSet contentType="audio" segmentAlignment="true" bitstreamSwitching="true">
<Representation mimeType="audio/mp4" codecs="mp4a.40.2" bandwidth="128000" audioSamplingRate="44100" id="2">
<AudioChannelConfiguration schemeIdUri="urn:mpeg:dash:23003:3:audio_channel_configuration:2011" value="2"/>
<SegmentTemplate timescale="1000000" startNumber="1" duration="5000000" media="audio/0/$Number$.m4s" initialization="audio/0/init.m4s"/>
</Representation>
</AdaptationSet>

<AdaptationSet contentType="audio" segmentAlignment="true" bitstreamSwitching="true">
<Representation mimeType="audio/mp4" codecs="mp4a.40.2" bandwidth="128000" audioSamplingRate="44100" id="3">
<AudioChannelConfiguration schemeIdUri="urn:mpeg:dash:23003:3:audio_channel_configuration:2011" value="2"/>
<SegmentTemplate timescale="1000000" startNumber="1" duration="5000000" media="audio/1/$Number$.m4s" initialization="audio/1/init.m4s"/>
</Representation>
</AdaptationSet>

And there would still be a way of selecting between the audio-tracks/adaptation-sets, presumably by an id (hence the original question in this thread). But right now, shaka will surface each of these tracks as language="und" and roles=[], and then there would be no well-defined way to switch between them because they all have the same language/role.

Let me know if there's still anything unclear or that you want me to elaborate on.

joeyparrish commented 7 years ago

I'm adding this to the backlog and labeling enhancement so we don't lose track of it. I need to spend some time considering what exactly we would change, if anything. I haven't had time to dig in yet.

srstrong commented 6 years ago

I'll add a +1 to this thread - we have essentially the exact same problem, in that we receive external streams that we need to transcode and package as dash streams. They frequently have multiple audio tracks with the same language, and we have no additional metadata to determine semantic meaning such as 'main', 'dub' etc. Right now, we are using the exact same numbered roles as a way to ensure uniqueness.

sbrez commented 4 years ago

Now that we have the label parameter filled inside the audioTrack object (#2178) the problem of choosing the right audioTrack with the ID, could be solved. We'll need a method like selectAudioLanguage but with the ID instead of language as parameter. Do you think that this functionality could be implemented?

zuzzurro commented 4 years ago

Let me add a small piece of information. Azure Media services had this functionality for a very long time in Smooth and HLS. We as @srstrong have the issue of having multiple audio tracks with the same language but different "name" (as an example multiple latin american broadcaster (name) using the same language (spanish). Using Elemental encoders and Azure is quite simple to pass both a Name and a Language. Dash made it more difficult becasue until now Azure had no standard tag to pass the Name information. In HLS this is output for instance as:

EXT-X-MEDIA:TYPE=AUDIO,GROUP-ID="audio",NAME="ambient"....

Now that we have the Label, all that remains to add is the final piece outlined above by @sbrez

Hope this makes the request crystal clear.

ismena commented 4 years ago

Hi @sbrez @zuzzurro Let's see if I get the request correctly: you want a way to get all variants where audio track == an audio track with a given role. Similar to how selectAudioLanguage filters out all variants in other languages. Is that right?

zuzzurro commented 4 years ago

I think there are two issues here. API and GUI. let me give you one example of what we could have:

Customer Label Language
broadcaster 1 ABC en
broadcaster 2 RTVE es
broadcaster 3 ESPN (english) en
broadcaster 3 ESPN (spanish) es
broadcaster 4 CBC en
broadcaster 4 Radio Canada fr

As you can see languages are non unique, but as we control the label at encoder level, we make sure that Labels are.

We choose the track to be played either via API (no user driven switch allowed) or via a GUI (user can choose the audio track based on what is shown).

If you could provide a selectAudioLabel call similar to the selectAudioLanguage one that would be a solution for the API side (and probably also to the GUI since we have our own GUI for that). That would still leave the problem of what Shaka should show when using the builtin chooser.

In general in my opinion if you have both a label and a language the Label should be shown in the assumption that if it has been created it will be more specific that the language. I can understand though that you may have some issue with backward compatibility.

In any case, @sbrez can correct me if I'm wrong the priority for us would be to have a working API ASAP.

Finally, I think @sbrez had a slightly different view to reach the same goal and it was to allow us to specify the audio by the ID you generate internally in the structure that gets passed to the HTML5 video object.

ismena commented 4 years ago

Ok, so you want the new method to return all tracks with a given label in in the currently selected language? Or all tracks with a given label regardless of the language?

In your example, if current language is English, and you're asking for ESPN tracks, would the return value be [ESPN (english)] or [ESPN (english), ESPN (spanish)]?

Either way, if you're up for creating a PR for this, we'll happily accept. Otherwise, I can work on it next week while waiting for code review for my other stuff :)

zuzzurro commented 4 years ago

Well, I wrote that table to clarify our situation and that may be a bit misleading. Let me try to clarify even further.

Customer Label Language
broadcaster 1 "ABC" en
broadcaster 2 "RTVE" es
broadcaster 3 "ESPN (english)" en
broadcaster 3 "ESPN (spanish)" es
broadcaster 4 "CBC" en
broadcaster 4 "Radio Canada" fr

I will never ask for "ESPN" tracks since there's no "ESPN" Label. They have to be considered unique IDs (and as you can see they are unique) so I will just ask for "ESPN (english)" and that will always return one and only one audio track.

If we were to provide both AAC and AC-3 versions, I will label each track accordingly again. To make it even more clear, here the Label is just a unique ID even more explicitly.

Customer Label Language
broadcaster 1 1 en
broadcaster 2 2 es
broadcaster 3 3 en
broadcaster 3 4 es
broadcaster 4 5 en
broadcaster 4 6 fr

Does this make sense?

ismena commented 4 years ago

Ah, gotcha. Can do.

You're very welcome to create a PR for this, if you'd like. Otherwise, I'll do my best to squeeze this in later this week or early next one.

ismena commented 4 years ago

Here's a caveat: it sounded like the labels were on audio tracks. Is that right? If we select by label, what you will get is every variant where this audio is present, e. g. a combination of your chosen audio with every video it's compatible with.

Does it make sense?

zuzzurro commented 4 years ago

What to you mean by "every video" in this contest? Every rendition?

I was also thinking about something else. If you look at my first post, I had an example of an HLS stream that has a NAME="ambient" attribute. In my model, "NAME" in HLS and "Label" in DASH have the same function. How does Shaka use the NAME HLS attribute and would it be compatible with my DASH proposal?

ismena commented 4 years ago

Re: "every video" - when we added HLS support, we went from operating on stream level(audio/video) to operating on variant level(audio+video), so if you have audio streams with specific labels, they will propagate to variant level. For DASH, every audio stream is compatible with every video stream, so you might end up with multiple variant tracks, each of which having the audio you requested.

You are absolutely right about NAME being treated the same as Label is DASH. This is exactly how we do things today.

zuzzurro commented 4 years ago

We took a look at doing our own PR, but we concluded we don't have enough experience with the internals to make it worthwhile. Can you do it yourself? We are totally available for testing and helping in other ways.

ismena commented 4 years ago

Ok, we're taking this on. FYI for the team and the users, what needs to be done:

@zuzzurro If you have a manifest for testing, please let us know! Thanks.

zuzzurro commented 4 years ago

For the time being there's:

https://amssamples.streaming.mediaservices.windows.net/f1ee994f-fcb8-455f-a15d-07f6f2081a60/Sintel_MultiAudio.ism/manifest(format=mpd-time-csf)

This one doesn't have duplicated languages, but it has at least the Labels.

Working on it..

zuzzurro commented 4 years ago

Can you see whether:

https://globaltechvideostorage.blob.core.windows.net/elephants-dream/manifest(format=mpd-time-csf)

works for you? It has three audio tracks, two of them in English.

ismena commented 4 years ago

Thanks for the test streams - very helpful! The change has just been checked in into master. The new player method selectVariantTrackByLabel() accepts a label string.

Feel free to take a look :)

zuzzurro commented 4 years ago

Wow, that's great. we'll check it right away. Just one more question from me. If as you said earlier the NAME tag in HLS has the same purpose as the Label ta in DASH, what is the player method that uses that for audio selection?

ismena commented 4 years ago

Same method: player.selectVariantByLabel().

We propagate the NAME attribute to the same "label" field on the variant track. So, a DASH label would ended up as a track.label attribute and hls NAME attribute will also end up as a track.label attribute.

sbrez commented 4 years ago

It works like a charm! Thanks a lot @ismena.

ismena commented 4 years ago

@sbrez Great to hear! :)