Closed ibc closed 5 months ago
Can somebody confirm whether a OPUS 2 bytes packet can contain real audio or not? @fippo @ggarber @vpalmisano ?
This is critical for a reason: In mediasoup we have an option to NOT forward Opus DTX packets from Producers to Consumers (to save bandwidth). This is, if that setting ignoreDtx: true
is set and a DTX packet arrives to mediasoup, then mediasoup won't forward it to the Consumers.
So we need a reliable way to detect whether the Opus packet is DTX or not., so hence the magic question:
Can a OPUS 2 bytes long packet contain real audio? or is it guaranteed to be DTX?
That check is odd... maybe @alvestrand can nag audio peeps. What I see as DTX being sent:
It interleaves an empty frame after TOC byte (so 1 byte payload) or the TOC and two bytes which are 0xfffe
This might be a kind of CNG: https://bugs.chromium.org/p/webrtc/issues/detail?id=7272&q=dtx&can=1
OMG now I must become an expert in Opus to handle this because of course there is no a simple way to detect if a Opus payload is DTX or not...
From what I understand of the Opus spec, a code 0 packet which omits spectral information for CNG could in theory have a total size of 2 bytes and contain real packet data. This means that the check for a <=2
size is probably not sufficient.
CNG payload spec: https://datatracker.ietf.org/doc/html/rfc3389
As for decoding DTX or not, it depends on the packet code since some codes omit the frame length coding.
As a reference, here's the TOC byte where config
determines encoder params, s
is a stereo flag, and c
is the packet code.
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
| config |s| c |
+-+-+-+-+-+-+-+-+
Figure 1: The TOC Byte
I believe that to fully cover the spec, we need to read the TOC byte's c
field and then handle these cases:
Code 0 (c = 0 0):
- Frame Length is omitted
- DTX determined by total length = 1 (TOC byte only)
Code 1 (c = 0 1):
- Frame Length is omitted
- DTX determined by total length = 1 (TOC byte only)
Code 2 (c = 1 0):
- TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame
- Frame lengths are both 0, so the length is indicated by a single byte
- NOTE: Per spec 'the only valid 2-byte code 2 packet is one where the length of both frames is zero'
- DTX determined by total length = 2
Code 3 (c = 1 1)
- The TOC byte is followed by a byte encoding the number of frames in the packet in bits
2 to 7 (marked "M" in Figure 5)
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|v|p| M |
+-+-+-+-+-+-+-+-+
Figure 5: The frame count byte
- Per 3.2.5. Code 3: A Signaled Number of Frames in the Packet: 'M MUST NOT be zero, and the audio
duration contained within a packet MUST NOT exceed 120 ms'. However this contradicts 3.2.1 Frame
Length Coding (below).
- Thus, I conclude that code 3 packets cannot indicate DTX.
Frame length coding for reference: https://datatracker.ietf.org/doc/html/rfc6716#appendix-B
3.2.1. Frame Length Coding When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed length of one or more of these frames is indicated with a one- or two-byte sequence, with the meaning of the first byte as follows: o 0: No frame (Discontinuous Transmission (DTX) or lost packet)
Code 0 (c = 0 0):
- Frame Length is omitted
- DTX determined by total length = 1 (TOC byte only)
Code 1 (c = 0 1):
- Frame Length is omitted
- DTX determined by total length = 1 (TOC byte only)
Code 2 (c = 1 0):
- TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame
- Frame lengths are both 0, so the length is indicated by a single byte
- NOTE: Per spec 'the only valid 2-byte code 2 packet is one where the length of both frames is zero'
- DTX determined by total length = 2
Code 3 (c = 1 1)
The TOC byte is followed by a byte encoding the number of frames in the packet in bits 2 to 7 (marked "M" in Figure 5)
0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |v|p| M | +-+-+-+-+-+-+-+-+ Figure 5: The frame count byte
Per 3.2.5. Code 3: A Signaled Number of Frames in the Packet: 'M MUST NOT be zero, and the audio duration contained within a packet MUST NOT exceed 120 ms'. However this contradicts 3.2.1 Frame Length Coding (below).
Thus, I conclude that code 3 packets cannot indicate DTX.
With this information I understand that DTX can be associated only to Code 0, 1, 2 packets, so basically the payload_.size() <= 2
covers all the cases, right?
payload_.size() <= 2
covers all the cases, right?
Correct, however it may falsely flag code 0
or 1
CNG packets as DTX.
payload_.size() <= 2
covers all the cases, right?Correct, however it may falsely flag code
0
or1
CNG packets as DTX.
Can you literally draw a packet (the exact bits) of that specific case?
We need those "audio peeps" i mentioned. Because even I do not fully understand the way to signal "ok dude, I am going into dtx mode now, just saying. Please make sure your Jitterbuffer is ok with that"
So, from above we know the DTX packets are:
Code 0 and Code 1 (TOC Byte only):
0
0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|0| level |
+-+-+-+-+-+-+-+-+
Code 2 (TOC + Frame Length 0):
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|0|0| 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Minimal comfort noise packets would be:
Code 0
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|0|0|0| level |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Code 1
0 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|0|1|0| level |0| level |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Code 2
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|1|0| 1 |0| level |0| level |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
NOTE: I'm not sure that code 2 packet would ever be sent, since code 2 is supposed to be Code 2: Two Frames in the Packet, with Different Compressed Sizes
. I suppose in theory you could have the first stream sending a DTX payload, and the second stream a CNG payload. The resulting packet would be 3 bytes and look like:
0 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config |s|1|0| 0 |0| level |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
So the case that's an issue seems to be code 0, with a CNG packet that has no spectral information.
Thanks, guys. Marking this PR as draft until we have bandwidth to implement your given feedback.
As per the opus source code's inline documentation for opus_encode()
, if the written size is 2 bytes or less then it's a DTX packet.
This PR is ready now. I've read the specs and agree with @kjvenalainen's conclusion above (PR description updated):
In summary:
A code 0 or code 1 packet with length 2 could contain 1 valid byte frame, so it's not guaranteed that if total length <= 2 then the packet is DTX.
Details