versatica / mediasoup

Cutting Edge WebRTC Video Conferencing
https://mediasoup.org
ISC License
6.18k stars 1.12k forks source link

OPUS: Fix DTX detection #1357

Closed ibc closed 5 months ago

ibc commented 6 months ago

Details

ibc commented 6 months ago

Can somebody confirm whether a OPUS 2 bytes packet can contain real audio or not? @fippo @ggarber @vpalmisano ?

This is critical for a reason: In mediasoup we have an option to NOT forward Opus DTX packets from Producers to Consumers (to save bandwidth). This is, if that setting ignoreDtx: true is set and a DTX packet arrives to mediasoup, then mediasoup won't forward it to the Consumers.

So we need a reliable way to detect whether the Opus packet is DTX or not., so hence the magic question:

Can a OPUS 2 bytes long packet contain real audio? or is it guaranteed to be DTX?

fippo commented 6 months ago

That check is odd... maybe @alvestrand can nag audio peeps. What I see as DTX being sent: image It interleaves an empty frame after TOC byte (so 1 byte payload) or the TOC and two bytes which are 0xfffe

This might be a kind of CNG: https://bugs.chromium.org/p/webrtc/issues/detail?id=7272&q=dtx&can=1

ibc commented 6 months ago

OMG now I must become an expert in Opus to handle this because of course there is no a simple way to detect if a Opus payload is DTX or not...

kjvenalainen commented 6 months ago

From what I understand of the Opus spec, a code 0 packet which omits spectral information for CNG could in theory have a total size of 2 bytes and contain real packet data. This means that the check for a <=2 size is probably not sufficient.

CNG payload spec: https://datatracker.ietf.org/doc/html/rfc3389

As for decoding DTX or not, it depends on the packet code since some codes omit the frame length coding.

As a reference, here's the TOC byte where config determines encoder params, s is a stereo flag, and c is the packet code.

    0 1 2 3 4 5 6 7
   +-+-+-+-+-+-+-+-+
   | config  |s| c |
   +-+-+-+-+-+-+-+-+
  Figure 1: The TOC Byte

I believe that to fully cover the spec, we need to read the TOC byte's c field and then handle these cases:

Code 0 (c = 0 0):
   - Frame Length is omitted
   - DTX determined by total length = 1 (TOC byte only)

Code 1 (c = 0 1):
   - Frame Length is omitted
   - DTX determined by total length = 1 (TOC byte only)

Code 2 (c = 1 0):
   - TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame
   - Frame lengths are both 0, so the length is indicated by a single byte
   - NOTE: Per spec 'the only valid 2-byte code 2 packet is one where the length of both frames is zero'
   - DTX determined by total length = 2

Code 3 (c = 1 1)
   - The TOC byte is followed by a byte encoding the number of frames in the packet in bits 
      2 to 7 (marked "M" in Figure 5)

              0 1 2 3 4 5 6 7
             +-+-+-+-+-+-+-+-+
             |v|p|     M     |
             +-+-+-+-+-+-+-+-+
        Figure 5: The frame count byte

   - Per 3.2.5.  Code 3: A Signaled Number of Frames in the Packet: 'M MUST NOT be zero, and the audio 
     duration contained within a packet MUST NOT exceed 120 ms'. However this contradicts 3.2.1 Frame 
     Length Coding (below).
   - Thus, I conclude that code 3 packets cannot indicate DTX.

Frame length coding for reference: https://datatracker.ietf.org/doc/html/rfc6716#appendix-B

3.2.1. Frame Length Coding When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed length of one or more of these frames is indicated with a one- or two-byte sequence, with the meaning of the first byte as follows: o 0: No frame (Discontinuous Transmission (DTX) or lost packet)

vpalmisano commented 6 months ago

Code 0 (c = 0 0):

  • Frame Length is omitted
  • DTX determined by total length = 1 (TOC byte only)

Code 1 (c = 0 1):

  • Frame Length is omitted
  • DTX determined by total length = 1 (TOC byte only)

Code 2 (c = 1 0):

  • TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame
  • Frame lengths are both 0, so the length is indicated by a single byte
  • NOTE: Per spec 'the only valid 2-byte code 2 packet is one where the length of both frames is zero'
  • DTX determined by total length = 2

Code 3 (c = 1 1)

  • The TOC byte is followed by a byte encoding the number of frames in the packet in bits 2 to 7 (marked "M" in Figure 5)

          0 1 2 3 4 5 6 7
         +-+-+-+-+-+-+-+-+
         |v|p|     M     |
         +-+-+-+-+-+-+-+-+
    Figure 5: The frame count byte
  • Per 3.2.5. Code 3: A Signaled Number of Frames in the Packet: 'M MUST NOT be zero, and the audio duration contained within a packet MUST NOT exceed 120 ms'. However this contradicts 3.2.1 Frame Length Coding (below).

  • Thus, I conclude that code 3 packets cannot indicate DTX.

With this information I understand that DTX can be associated only to Code 0, 1, 2 packets, so basically the payload_.size() <= 2 covers all the cases, right?

kjvenalainen commented 6 months ago

payload_.size() <= 2 covers all the cases, right?

Correct, however it may falsely flag code 0 or 1 CNG packets as DTX.

ibc commented 6 months ago

payload_.size() <= 2 covers all the cases, right?

Correct, however it may falsely flag code 0 or 1 CNG packets as DTX.

Can you literally draw a packet (the exact bits) of that specific case?

fippo commented 6 months ago

We need those "audio peeps" i mentioned. Because even I do not fully understand the way to signal "ok dude, I am going into dtx mode now, just saying. Please make sure your Jitterbuffer is ok with that"

kjvenalainen commented 6 months ago

So, from above we know the DTX packets are:

Code 0 and Code 1 (TOC Byte only):

 0              
 0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+
|0|   level     |
+-+-+-+-+-+-+-+-+

Code 2 (TOC + Frame Length 0): 

 0                   1          
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config  |s|0|0|       0       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Minimal comfort noise packets would be:

Code 0

 0               1               
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config  |s|0|0|0|   level     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Code 1

 0               1               2               
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config  |s|0|1|0|   level     |0|   level     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Code 2

 0               1               2               3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config  |s|1|0|       1       |0|   level     |0|   level     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

NOTE: I'm not sure that code 2 packet would ever be sent, since code 2 is supposed to be Code 2: Two Frames in the Packet, with Different Compressed Sizes. I suppose in theory you could have the first stream sending a DTX payload, and the second stream a CNG payload. The resulting packet would be 3 bytes and look like:

 0               1               2               
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| config  |s|1|0|       0       |0|   level     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

So the case that's an issue seems to be code 0, with a CNG packet that has no spectral information.

ibc commented 6 months ago

Thanks, guys. Marking this PR as draft until we have bandwidth to implement your given feedback.

jmillan commented 6 months ago

As per the opus source code's inline documentation for opus_encode(), if the written size is 2 bytes or less then it's a DTX packet.

https://github.com/xiph/opus/blob/main/include/opus.h#L143

ibc commented 5 months ago

This PR is ready now. I've read the specs and agree with @kjvenalainen's conclusion above (PR description updated):

In summary:

A code 0 or code 1 packet with length 2 could contain 1 valid byte frame, so it's not guaranteed that if total length <= 2 then the packet is DTX.