Understanding vbv-len and maxbitrate

I'm trying to understand how to generate compliant streams, and controlling the virtual buffer size is key. I don't quite understand the impact of these parameters when using tsMuxer though.

Using some theoretical model to control "bit flow" is common among the MPEG video codecs (MPEG-2 video/H.262, MPEG-4 Visual, AVC/H.264 and HEVC/H.265). In MPEG-2 video and MPEG-4 Visual it's called Video Buffering Verifier (VBV) while it's called Hypothetical Reference Decoder (HRD) but it's basically the same concept. The combination of profile and level for the specific codec defines maximum allowed values for both the buffer size and maximum bitrate, among other things. Hardware players often adhere strictly to these limits, as they are build with hardware buffers of a certain size and cannot handle even a "small overflow".

From what I understand the VBV/HRD parameters are stored as metadata in the video streams, so that decoders can "set up" the correct buffers before starting to decode.

These things apply to the encoded video stream, not to the transport stream as far as I can understand. I haven't heard of it when it comes to audio codecs. So, since tsMuxer doesn't re-encode the elementary streams where these "rules" must be followed, I'm wondering what impact these parameters have. I'm really hoping that setting these doesn't manipulate the metadata in the video stream, that would be very bad as they need to reflect the values that the encoder has actually used when generating the stream.

I've tried to look at the source code, but I'm too unfamiliar with the code base to make much sense of it. It doesn't look to me like any manipulation of metadata is done though. From what I can see, it only seems to impact the PCR frequency..? I'm not sure if this applies to the transport stream as a whole, or to the video stream.

I'm also wondering what maxbitrate does (despite being called cbrBitrate in the code). What happens if the sum of the bitrates of the muxed streams exceeds this value? Without re-encoding, I wouldn't think there was much tsMuxer could do about it, except maybe throwing an error. Again I'm wondering if this value is stored as metadata in the stream, or if it's only used as a part of some calculations.

Last, it's the twist that vbv-len is specified in milliseconds, not in bits as is the norm. It's easy enough to calculate the necessary value, if I know what the "reference bitrate" is, but that's not clear to me. To get a buffer size of for example the very common limit of 1835008 bits, exactly what bitrate value would I need to use to make sure that the millisecond value I specify result in a buffer of 1835008 bits. Or is this irrelevant, because it doesn't actually impact the VBV/HRD buffer size of the video stream?

@Nadahar the ts/m2ts stream is made of 188/192 bytes packets. The parameters control the flow of the packets, which is important for radio/internet streaming or Bluray Disk reading. If you're interested, joint ISO/IEC 13818-1 / ITU-T H.222.0 will give you more details about buffer and PCR (Program Clock Reference) management in MEPG-TS.

From what I can understand, --vbv-len will fix the maximum offset between the PCR packet reading and the stream presentation time PTS, and --maxbitrate will fix the minimum timegap between two consecutive packet PCR timestamps / reading.

@jcdr428 Thanks, I'll do some reading in 13818-1. I didn't know that the "VBV concept" was used in the standard for TS, I thought it was for video codecs only. However, I still wonder exactly how these parameters should be used, and I think it would be preferable if the documentation gave some information that didn't require "everyone" to acquire and read the standard.

I'll come back once I've studied 13818-1 and see if I have any suggestions or viewpoints.

I've read what seems to be the relevant parts multiple times, but it's hard to grasp all this without being intimately familiar with the details of a transport stream. It seems to me that there is two "modes" for transport streams, the leak method and the vbv_delay method. Below are some definitions that seems relevant.

MB_n is the multiplexing buffer, for elementary stream n. It is present only for video elementary streams.

EBS_n is the size of the elementary stream buffer EBn, measured in bytes.

Rbx_n is the rate at which PES packet payload data are removed from MBn when the leak method is used. Defined only for video elementary streams.

Rbx_n(j) is the rate at which PES packet payload data are removed from MBn when the vbv_delay method is used. Defined only for video elementary streams.

All bytes that enter the buffer TB_n are removed at the rate Rx_n specified below. Bytes which are part of the PES packet header or its contents are delivered to the main buffer B_n for audio elementary streams and system data, and to the multiplexing buffer MB_n for video elementary streams. Other bytes are not, and may be used to control the system. Duplicate transport stream packets are not delivered to B_n, MB_n, or B_sys.

The buffer TB_n is emptied as follows:

When there is no data in TB_n, Rx_n is equal to zero.

Otherwise for video: Rx_n = 1,2 x R_max[profile, level]

R_max[profile, level] is specified according to the profile and level which can be found in Table 8-13 of Rec. ITU-T H.262 | ISO/IEC 13818-2. This table specifies the upper bound of the rate of each elementary video stream within a specific profile and level.

Rx_n is equal to 1, 2 × R_max for ISO/IEC 11172-2 constrained parameter video streams, where R_max refers to the maximum bitrate for a constrained parameters bitstream in ISO/IEC 11172-2.

The elementary stream buffer sizes EBS₁ through EBS_n are defined for video as equal to the vbv_buffer_size as it is carried in the sequence header. Refer to the summary of constrained parameters in ISO/IEC 11172-2 and Table 8-14 of Rec. ITU-T H.262 | ISO/IEC 13818-2.

ITU-T H.262 | ISO/IEC 13818-2 is what is otherwise known as "MPEG-2 video" and ISO/IEC 11172-2 is "MPEG-1 video". There are extensions defined for newer codecs like MPEG-4 Visual, AVC/H.264, HEVC/H.265 and others, but I have a hard time to understand the link between the "transport stream buffer" and the video codecs.

To sum it up, I think I'm none the wiser at this point.

From 2.4.2.7 Buffer management:

Transport streams shall be constructed so that conditions defined in this subclause are satisfied. This subclause makes use of the notation defined for the system target decoder.

TB_n and TB_sys shall not overflow. TB_n and TB_sys shall empty at least once every second. B_n shall not overflow nor underflow. B_sys shall not overflow.

EB_n shall not underflow except when the low delay flag in the video sequence extension is set to '1' (refer to 6.2.2.3 in Rec. ITU-T H.262 | ISO/IEC 13818-2) or trick_mode status is true.

When the leak method for specifying transfers is in effect, MB_n shall not overflow, and shall empty at least once every second. EB_n shall not overflow. When the vbv_delay method for specifying transfers is in effect, MB_n shall not overflow nor underflow, and EB_n shall not overflow.

The delay of any data through the system target decoder buffers shall be less than or equal to one second except for still picture video data, ISO/IEC 14496 streams, ISO/IEC 23008-2 streams, ISO/IEC 23090-3 streams and ISO/IEC 23094-1 streams. Specifically: td_n(j) – t(i) ≤ 1 second for all j, and all bytes i in access unit A_n(j).

For still picture video data, the delay is constrained by tdn(j) – t(i) ≤ 60 seconds for all j, and all bytes i in access unit An(j).

For ISO/IEC 14496, ISO/IEC 23008-2, ISO/IEC 23090-3 and ISO/IEC 23094-1 streams, the delay is constrained by td_n(j) – t(i) ≤ 10 seconds for all j, and all bytes i in access unit A_n(j).

ISO/IEC 14496 is what is otherwise known as MPEG-4, which contains two video codecs: MPEG-4 Visual (part 2) and AVH/H.264 (part 10). ISO/IEC 23008-2 is HEVC/H.265, ISO/IEC 23090-3 is VVC/H.266 and ISO/IEC 23094-1 is MPEG-5 EVC.

The way I understand this is that for MPEG-1 video and MPEG-2 video/H.262, the buffer size shouldn't be bigger than what represents once second of data, for still picture video data (whatever that is, maybe MJPEG and similar?) it shouldn't be bigger than what represents 60 seconds of data and for MP4v, H.264, H.265, H.266 and EVC it shouldn't be bigger than what represents 10 seconds of data.

I'm not sure if the current vbv-len supports this, will it actually work to use a value of 10000 for H.264 for example? It also makes the default of 500 milliseconds seem like a strange choice.

It also seems to me like both buffer size and maximum bitrate should be specified in different headers/descriptors/metadata. I couldn't find references in the code to where these parameters are used for this. If so, what information is used there instead?

2.6.32 STD descriptor

This descriptor is optional and applies only to the T-STD model and to Rec. ITU-T H.262 | ISO/IEC 13818-2 video elementary streams, and is used as specified in 2.4.2. This descriptor does not apply to program streams (see Table 2-70)

2.6.33 Semantic definition of fields in STD descriptor

leak_valid_flag – The leak_valid_flag is a 1-bit flag. When set to '1', the transfer of data from the buffer MB_n to the buffer EB_n in the T-STD uses the leak method as defined in 2.4.2.4. If this flag has a value equal to '0', and the vbv_delay fields present in the associated video stream do not have the value 0xFFFF, the transfer of data from the buffer MB_n to the buffer EB_n uses the vbv_delay method as defined in 2.4.2.4.

2.6.26 Maximum bitrate descriptor

maximum_bitrate – The maximum bitrate is coded as a 22-bit positive integer in this field. The value indicates an upper bound of the bitrate, including transport overhead, that will be encountered in this program element or program. The value of maximum_bitrate is expressed in units of 50 bytes/second. The maximum_bitrate_descriptor is included in the Program Map Table (PMT). Its presence as extended program information indicates applicability to the entire program. Its presence as ES information indicates applicability to the associated program element.

2.6.52 MultiplexBuffer descriptor

The MultiplexBuffer descriptor (see Table 2-81) conveys the size of the multiplex buffer MB_n, as well as the leak rate Rx_n at which data is transferred from transport buffer TB_n into buffer MB_n for a specific Rec. ITU-T H.222.0 | ISO/IEC 13818-1 program element referenced by an elementary_PID value in the Program Map Table. One MultiplexBuffer descriptor shall be associated with each elementary_PID that contains an ISO/IEC 14496 FlexMux stream or SL-packetized stream, including those containing ISO_IEC_14496_sections. See 2.11.3.9 for the definition of buffers and rates in the T-STD model for decoding of ISO/IEC 14496 content. The MultiplexBuffer descriptor shall be conveyed in the descriptor loop immediately following the ES_info_length field in the Program Map Table.

2.6.53 Semantic definition of fields in MultiplexBuffer descriptor

MB_buffer_size – This 24-bit field shall specify the size in byte of buffer MB_n of the elementary stream n that is associated with this descriptor.

TB_leak_rate – This 24-bit field shall specify in units of 400 bits per second the rate at which data is transferred from transport buffer TB_n to multiplex buffer MB_n for the elementary stream n that is associated with this descriptor.

It might be that I'm mixing the buffer and rate information for the elementary streams with the equivalent information for the transport stream itself, but it's not clear to me if there's a clear line between these, and if so, where the line is. As far as I can tell, the information for the transport stream is supposed to be calculated based on the information from the elementary streams using certain rules. Elementary streams that doesn't have such information, for example audio streams, have fixed values defined that should be used.

BS means buffer size (in bytes) and for MP3 and AAC we have these definitions:

For ISO/IEC 11172-3 or ISO/IEC 13818-3 audio: BS_n = 2848 bytes

For ISO/IEC 13818-7 ADTS audio:

BS_n = 2848 bytes if 1-2 channels

BS_n = 7200 bytes if 3-8 channels

BS_n = 10800 bytes if 9-12 channels

BS_n = 43200 bytes if 13-48 channels

As far as I can understand, at least some of this must already be a part of tsMuxer, or it's hard to imagine that the streams would be playable at all. But it's very confusing for me to understand exactly what we're specifying using these parameters and how to use them to get the correct results.

When going back and looking at #108, it seems that many of my questions are at least partly answered. It's still not quite clear to me exactly what these parameters do, but it seems clear that at this point, making a compliant stream with tsMuxer isn't really possible.

@Nadahar yes there is still substantial work to be done on the buffer management of the elemental streams...

justdan96 / tsMuxer

Understanding vbv-len and maxbitrate #503