ietf-wg-cellar / matroska-specification

Matroska specification.
http://ietf-wg-cellar.github.io/matroska-specification
Other
123 stars 44 forks source link

Consider providing a facility for integer-fraction timescales #422

Open rcombs opened 4 years ago

rcombs commented 4 years ago

It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.

The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.

This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):

nearest_ns(x) = round(x * 1,000,000,000) / 1,000,000,000
ceil_ns(x) = ceil(x * 1,000,000,000) / 1,000,000,000
floor_ns(x) = floor(x * 1,000,000,000) / 1,000,000,000
nearest_error(x) = 1 - (x / nearest_ns(x))
ceil_error(x) = 1 - (x / ceil_ns(x))
floor_error(x) = 1 - (x / floor_ns(x))
nearest_error_3h(x) = nearest_error(x) * 60 * 60 * 3
ceil_error_3h(x) = ceil_error(x) * 60 * 60 * 3
floor_error_3h(x) = floor_error(x) * 60 * 60 * 3
e(x) = nearest_error_3h(1 / x)
ce(x) = ceil_error_3h(1 / x)
fe(x) = floor_error_3h(1 / x)

# Integer video frame rates
e(24)        => 8.64e-5
e(25)        => 0
e(30)        => -0.0001
e(48)        => -0.0002
e(50)        => 0
e(60)        => 0.0002
e(120)       => -0.0004

# NTSC video frame rates
e(24/1.001)  => -8.6314e-5
e(30/1.001)  => 0.0001
e(48/1.001)  => 0.0002
e(60/1.001)  => -0.0002
e(120/1.001) => 0.0004

# TrueHD frame rates
e(44100/40)   => -0.0057
e(48000/40)   => -0.0043
e(88200/40)   => 0.0062
e(96000/40)   => 0.0086

# AAC frame rates
e(44100/960)  => -0.0002
e(48000/960)  => 0
e(88200/960)  => 0.0003
e(96000/960)  => 0
e(44100/1024) => 0.0002
e(48000/1024) => -0.0002
e(88200/1024) => -0.0003
e(96000/1024) => 0.0003

# MP3 frame rates
e(44100/1152)  => 8.4375e-6
e(48000/1152)  => 0
e(88200/1152)  => -0.0004
e(96000/1152)  => 0

# Other audio frame rates
e(44100/128)   => -0.0012
e(48000/128)   => 0.0013
e(88200/128)   => -0.0012
e(96000/128)   => -0.0027
e(44100/2880)  => -7.425e-5
e(48000/2880)  => 2.3981e-12
e(88200/2880)  => -7.425e-5
e(96000/2880)  => 2.3981e-12

# GCF of common short-first audio frame sizes
e(44100/64)   => -0.0012
e(48000/64)   => -0.0027
e(88200/64)   => 0.0062
e(96000/64)   => 0.0054

# Raw audio sample rates
e(44100)     => 0.1253
e(48000)     => -0.1728
e(88200)     => 0.1253
e(96000)     => 0.3456

fe(44100)    => -0.351
ce(48000)    => 0.3456
fe(88200)    => -0.8273
fe(96000)    => -0.6912

# MPEGTS time base
e(90000)     => -0.108

ce(90000)    => 0.8639

# Common multiples
e(30000)     => -0.108
e(60000)     => 0.216
e(120000)    => -0.432
e(240000)    => 0.8639
e(480000)    => -1.7283

ce(30000)    => 0.216
fe(60000)    => -0.432
ce(120000)   => 0.8639
fe(240000)   => -1.7283
ce(480000)   => 3.4549

As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.

There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).

All of these issues can be addressed in one of the following ways:

This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).

If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.

rcombs commented 4 years ago

…I just realized I'd misremembered where timescales are specified when writing this (they're on the segment, not the track). Still, the same concept applies, just requiring common-multiple rates (though the TrackTimestampScale element could be used to account for this to some extent; it's deprecated, but all existing players other than MPlayer-derived ones seem to support it).

dericed commented 4 years ago

Hi, I see this was brought up as an issue in the GitHub repository and am cross-posting to the cellar working group.

On Sep 7, 2020, at 12:02 AM, rcombs notifications@github.com wrote: It's pretty well-established that Matroska's poor timebase support is one of the format's worst properties. While it support very precise timestamps (down to the nanosecond), it's very inefficient to do so (and the resulting values still aren't exact for most input rates), so muxers tend to default to 1ms timestamps, which can lead to a variety of subtle issues, especially with high-packet-rate streams (e.g. audio) and VFR video content. Muxers can choose rates that are closer to the time base of their inputs (or the packet rate of the content), but exactly how best to do so has always been unclear, and some of the possible options would lead to either worse player behavior, or timestamp drift. I'm proposing a format addition to remedy this.

The only actual normative change I propose is this: in addition to the classic nanosecond-denominator time scale, muxers could provide 2 additional integers, serving as a numerator and denominator time base value, which is required to round to the existing nanosecond-scaled value.

This should be paired with some advice for muxer implementations on how to make use of this feature. This depends on the properties of the input. For reference, here are some examples of the error produced by rounding a variety of common time bases to the nearest nanosecond, scaled by 3 hours (a reasonable target for the duration of a film):

nearest_ns(x) = round(x 1,000,000,000) / 1,000,000,000 ceil_ns(x) = ceil(x 1,000,000,000) / 1,000,000,000 floor_ns(x) = floor(x 1,000,000,000) / 1,000,000,000 nearest_error(x) = 1 - (x / nearest_ns(x)) ceil_error(x) = 1 - (x / ceil_ns(x)) floor_error(x) = 1 - (x / floor_ns(x)) nearest_error_3h(x) = nearest_error(x) 60 60 3 ceil_error_3h(x) = ceil_error(x) 60 60 3 floor_error_3h(x) = floor_error(x) 60 60 3 e(x) = nearest_error_3h(1 / x) ce(x) = ceil_error_3h(1 / x) fe(x) = floor_error_3h(1 / x)

Integer video frame rates

e(24) => 8.64e-5 e(25) => 0 e(30) => -0.0001 e(48) => -0.0002 e(50) => 0 e(60) => 0.0002 e(120) => -0.0004

NTSC video frame rates

e(24/1.001) => -8.6314e-5 e(30/1.001) => 0.0001 e(48/1.001) => 0.0002 e(60/1.001) => -0.0002 e(120/1.001) => 0.0004

TrueHD frame rates

e(44100/40) => -0.0057 e(48000/40) => -0.0043 e(88200/40) => 0.0062 e(96000/40) => 0.0086

AAC frame rates

e(44100/960) => -0.0002 e(48000/960) => 0 e(88200/960) => 0.0003 e(96000/960) => 0 e(44100/1024) => 0.0002 e(48000/1024) => -0.0002 e(88200/1024) => -0.0003 e(96000/1024) => 0.0003

MP3 frame rates

e(44100/1152) => 8.4375e-6 e(48000/1152) => 0 e(88200/1152) => -0.0004 e(96000/1152) => 0

Other audio frame rates

e(44100/128) => -0.0012 e(48000/128) => 0.0013 e(88200/128) => -0.0012 e(96000/128) => -0.0027 e(44100/2880) => -7.425e-5 e(48000/2880) => 2.3981e-12 e(88200/2880) => -7.425e-5 e(96000/2880) => 2.3981e-12

GCF of common short-first audio frame sizes

e(44100/64) => -0.0012 e(48000/64) => -0.0027 e(88200/64) => 0.0062 e(96000/64) => 0.0054

Raw audio sample rates

e(44100) => 0.1253 e(48000) => -0.1728 e(88200) => 0.1253 e(96000) => 0.3456

fe(44100) => -0.351 ce(48000) => 0.3456 fe(88200) => -0.8273 fe(96000) => -0.6912

MPEGTS time base

e(90000) => -0.108

ce(90000) => 0.8639

Common multiples

e(30000) => -0.108 e(60000) => 0.216 e(120000) => -0.432 e(240000) => 0.8639 e(480000) => -1.7283

ce(30000) => 0.216 fe(60000) => -0.432 ce(120000) => 0.8639 fe(240000) => -1.7283 ce(480000) => 3.4549 As we can see, rounding common video and audio frame rates (including e.g. the least common multiple of 24 and 60 for that VFR case) produces a negligible amount of error over a reasonable duration. This means that for content where all timestamps can reasonably be expressed in integer values of those rates, there would be no significant error over common file durations, even if different streams were muxed with different time bases.

There are a few real-world time bases that would produce significant rounding error (upwards of 100ms) over the course of 3 hours when used in existing players: MPEGTS's 90000Hz, all common raw audio sample rates, and least-common-multiples between integer and NTSC video frame rates. This essentially means that mixing these rates with others would produce significant desync over a reasonable duration for static on-disk content; the same issue could occur when muxing very lengthy content (e.g. streaming).

All of these issues can be addressed in one of the following ways:

Using a lower rate (e.g. 90,000Hz isn't usually the real content rate but instead an artifact of its previous container; expressing timestamps in samples rather than frames is usually unnecessary) Choosing the highest of the input rates for all streams (e.g. 48000 is a multiple of many common frame rates, including 24/1.001) Choosing a more precise common-multiple rate that may create a larger total drift, but does so equally for all streams (see the "Common multiples" section; 1/30000 is suitable for mixing 24fps and 30/1.001fps content alongside most common framed audio rates, while the later listed bases are suitable for increasingly large sets). Round some tracks' nanosecond timescales in the opposite direction, creating a larger drift, but potentially one with the same sign (and thus a closer value) as the drift in other tracks (this is probably too complex and niche to have substantial use) Fall back to classic rounded nanosecond-based timestamps (and don't write an integer-fraction time base at all) Use the extension, resulting in significant sync drift in older players that haven't implemented the change This last option is usually unacceptable, but may be fine for files that use codecs that become available after the change is made (and thus are unavoidably non-backwards-compatible anyway).

If combined with clear advice in the spec on how muxers SHOULD (or MAY) decide on time bases for various possible input cases, I think this extension could get actual adoption in muxers and solve one of the format's longest-standing problems.

This has been discussed on the list before though I don’t remember clear consensus on how to address this. Steve even compiled a list of discussions on this at https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/ https://mailarchive.ietf.org/arch/msg/cellar/ZpZxhG1gML9xVx_ir1Jf6_gcI8U/.

I proposed an option in this https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ https://mailarchive.ietf.org/arch/msg/cellar/mTprgjNqVbe20e6hyYxns8ZnVwY/ where one of the existing reserved bits of the Block Header (in the byte that contains the keyframe, invisible, and lacing flags) be used as a flag for Timescale Alignment.

With this approach, new elements could be added to the track header with a numerator and denominator of a rationale time scale and if Timescale Alignment were set to true, then the nearest increment of the rationale time scale would be used. Example:

Thus if the frame rate of the track header is 120000/1001, then

If Matroska timecode is 4 and Enable TimeScale Alignment is 0, than it is at 4 / (1000000000 / TimecodeScale ). If Matroska timecode is 4 and Enable TimeScale Alignment is 1, than it is at 0 / 1200000 (nearest increment of the rationale frame rate).

If Matroska timecode is 17 and Enable TimeScale Alignment is 0, than it is at 17 / (1000000000 / TimecodeScale ). If Matroska timecode is 17 and Enable TimeScale Alignment is 1, than it is at 2002 / 1200000 (nearest increment of the rationale frame rate).

In a Matroska demuxer doesn’t understand the new nom/denom elements or the Alignment flag then it would simply use the existing nanosecond timestamp system.

In that thread there were other proposals, for example Steve discussed using a float to depict a point in time. Dave

ghost commented 4 years ago

Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)

Of course all of these ideas are terrible hacks compared to just storing it in the correct way.

dericed commented 4 years ago

On Sep 8, 2020, at 11:52 AM, wm4 notifications@github.com wrote:

Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)

That sounds interesting: to have the rounding error numerator in each block and the rounding error denominator in the track header. Perhaps a rounding error denominator could also be in the block but defaults to the one within the track header. Of course all of these ideas are terrible hacks compared to just storing it in the correct way.

Yes, it is a challenge to fix this and maintain reverse compatibility. Dave

mbunkus commented 4 years ago

Did anyone suggest storing the rounding error as a fraction? (With denominator stored in the track header, this is only 3 bytes per packet in the best case.)

I don't like this as storing rounding errors ist imprecise as well (unless the global timestamp scaling factor is a multiple of the rounding error's denominator). I'm also quite unsure which denominator a multiplexer should chose. In order to express a rounding error precisely it must have a much higher resolution that the usual 1ms resolution of Matroska timestamps. For example, with 1001/30000 FPS content the rounding error will always be below one frame duration, therefore you'll have to make the denominator much larger.

Something else that came to mind when reading our previous discussion that Dave linked to: please keep in mind that any solution that sets values for the whole track in the track header will inevitably fail with mixed frame rate content or content with different interlacing, e.g. when multiplexing from an MPEG transport stream recorded from a DVB broadcast. Those bloody streams change frame rates all the time when the program changes, e.g. when transitioning to and from commercials (or just from an announcement to the movie). With our new and shiny precise timestamp calculation we'll either have to forbid such changes (unrealistic) or provide facilities to signal such changes in the form of some type of global index similar to cues. Unlike cues, though, such an index would have to be mandatory (a file without cues can be played just fine, even seeking works similar to seeking in Ogg files — meaning some kind of binary search).

File types whose timestamps are based solely on a stream's regular sampling frequency (MP4 usually is, but doesn't have to; Ogg does, too) all share those issues. MPEG TS on the other hand uses a 90 KHz-based clock which is fine for most video stuff but doesn't have enough resolution for sample-precision timing of audio tracks with high sampling frequency.

… in the correct way.

Due to what I've written above I'm pretty sure that there is no one correct way to store timestamps for a general purpose container that allows its content to change its time base in the middle.

In theory Matroska's timestamps can have sample-precision already (just make global timestamp scale small enough to match all of the tracks' time bases). The problem is with the waste of space that follows due to the bloody 16-bit integer offset in Block & SimpleBlock.

So if we're thinking about breaking compatibility anyway, why not think about a whole new SimpleBlock V2 that allows for much larger relative timestamps? Would make all existing players incompatible, though.

Another idea that only wastes space but doesn't destroy existing players' ability to play the file: adding a new child to Block called PreciseRelativeTimestamp or whatever that contains the difference between the timestamp-scaling-based timestamp & the actual, precise one, in nanoseconds. Cannot be used with SimpleBlocks, of course. Will take several bytes per BlockGroup.

ghost commented 4 years ago

I don't like this as storing rounding errors ist imprecise as well (unless the global timestamp scaling factor is a multiple of the rounding error's denominator).

It can be 100% exact. It's the rounding error after all - the number that needs to be added to the "classic" ms timestamp to get the fractional timestamp

I'm also quite unsure which denominator a multiplexer should chose. In order to express a rounding error precisely it must have a much higher resolution that the usual 1ms resolution of Matroska timestamps. For example, with 1001/30000 FPS content the rounding error will always be below one frame duration, therefore you'll have to make the denominator much larger.

It seems the denominator of the rounding error is simply the denominator of the original timestamp. E.g. in this case, the rounding error would have denominator 30000 and nominator (n*1001/30000 - int(n*1001/30000*1000)/1000) * 30000) for frame n or something like this. This is probably wrong, just typing this out casually. Actually probably also needs a constant nominator part (to be stored in the track header) of 1001.

Something else that came to mind when reading our previous discussion that Dave linked to: please keep in mind that any solution that sets values for the whole track in the track header will inevitably fail with mixed frame rate content or content with different interlacing, e.g. when multiplexing from an MPEG transport stream recorded from a DVB broadcast. Those bloody streams change frame rates all the time when the program changes, e.g. when transitioning to and from commercials (or just from an announcement to the movie). With our new and shiny precise timestamp calculation we'll either have to forbid such changes (unrealistic) or provide facilities to signal such changes in the form of some type of global index similar to cues. Unlike cues, though, such an index would have to be mandatory (a file without cues can be played just fine, even seeking works similar to seeking in Ogg files — meaning some kind of binary search).

What does Matroska do if the codec changes? Transport streams can do that, Matroska can't do that. I feel like bringing up such cases just complicates the whole discussion. You can't fix everything at the same time. But you can stall any progress by wanting to consider every possible future feature and requirement.

Besides, as was suggested in a previous post, the denominator part could be overridden per packet. This would cause some bytes of overhead in such obscure cases as mixing multiple framerates that are not known in advance.

In theory Matroska's timestamps can have sample-precision already (just make global timestamp scale small enough to match all of the tracks' time bases). The problem is with the waste of space that follows due to the bloody 16-bit integer offset in Block & SimpleBlock.

I guess you mean the fact that every packet will need its own cluster. But AFAIK that still doesn't give a way to get fractional timestamps? So, not an option.

So if we're thinking about breaking compatibility anyway, why not think about a whole new SimpleBlock V2 that allows for much larger relative timestamps? Would make all existing players incompatible, though.

Obviously not an option. If it were specified, it's likely everyone would disable this by default, except people who use Matroska in special setups where they control producer and consumer.

Another idea that only wastes space but doesn't destroy existing players' ability to play the file: adding a new child to Block called PreciseRelativeTimestamp or whatever that contains the difference between the timestamp-scaling-based timestamp & the actual, precise one, in nanoseconds. Cannot be used with SimpleBlocks, of course. Will take several bytes per BlockGroup.

I thought that was what I proposed here (except I wanted to use fractional numbers).

ghost commented 4 years ago

PS: I think obsessing about a few bytes per packet isn't useful. Having precise timestamps, even if it introduces overhead, is much more important. Nobody will discard Matroska as an option because it doesn't go to the edge of the theoretically possible for saving overhead.

mbunkus commented 4 years ago

What does Matroska do if the codec changes? Transport streams can do that, Matroska can't do that.

True. The difference is that having multiple time bases in the same track is something that exists & works today.

I'm really not trying not prevent progress here, and I'm not talking about each and every possible situation. I am talking about one specific situation that is in wide-spread use today.

What I am trying to prevent is implementing a scheme that's supposed to improve one aspect that simultaneously makes another aspect worse. Hence me talking about ways to signal a change in time base mid-stream. We'd also have to signal a precise timestamp at the point of change in time base so that the player can reconstruct the whole timeline properly without having to read blocks at each change in time base.

gbooker commented 4 years ago

It seems like there are a few ways discussed to correct this:

  1. Express the time-base in the track so the demuxer can adjust the timestamps in the file to the closest increment of the time-base
  2. Express a fractional error value using a denominator in the track and numerator in the packet so the demuxer can give more precise timestamps
    1. Potentially allow overriding the denominator on a per-packet basis
  3. Express a second timestamp using a fractional time-base stored in the track
    1. Potentially allow just expressing the timestamp in a numerator/denominator so as to ignore/override the track's time-base

All of these would still require the current timestamp to still exist and thus would be compatible with current demuxers but newer demuxers would be able to read/derive more precise timestamps.

It seems the denominator of the rounding error is simply the denominator of the original timestamp.

Close. When I saw this first suggested, I did some math and figured out what it would be for the case of 44.1kHz AAC audio (this is what really sparked this conversation; see below). In this case the samples are 1024/44100 seconds long with the MKA using 1ms precision on the timestamps and so the error can be expressed as m/1000 - n*1024/44100 where m is the timestamp in the MKA and n is the packet number. To express the error exactly in integers, the denominator is lcm(1000, 44100) (your basic fractions with common denominator) which in this case is 441000. Using some quick examples: Packet number MKA timestamp Error (using 441,000 as the denominator)
0 0 0
1 23 97
2 46 194
3 70 -150
354 8220 -60
355 8243 37

Also worth noting: the duration would likely need the same treatment.

Aside: I sparked this conversation in an internal discussion with @rcombs about AAC 44.1kHz audio in an MKA format. I was remuxing this to MPEG-TS and the MKA had only 1ms precision timestamps. Well, a simple remux would be multiplying these timestamps by 90 to match MPEG-TS. This simple remux resulted in packets at times 0, 2070, 4140, 6300, 8370 … which gave them effective durations of 0, 2070, 2070, 2160, 2070 and this inconsistency would cause stuttering audio in Apple's HLS demuxer. So this meant that remuxing MKA -> MPEG-TS required opening a codec in lavc to get more precise durations and thus derive timestamps without error.

P.S. These imprecise timestamps were one of the more annoying things we had to deal in Perian's MKV demuxer and that was over a decade ago.

robUx4 commented 4 years ago

The reason not to introduce a SimpleBlock v2 is that any hardware/software player that doesn't know it won't be able to play the files. It can be done in Matroska v5. Such files being unreabable by v5 parsers will also be marked as such. We might as well call it Matroska2 or something, just like WebM shares a lot with Matroska.

The practical question is whether there is a convenient way to have precise timestamps in v4 and make it work in existing players (and WebM, I know that's something they want as well).

The question about VFR (Variable Frame Rate) is not really an issue IMO. In the end you only have 1 or 2 frame rates mixed, and maybe with the same denominator. All you need is a fraction that handle both. Facebook even created a timebase that covers most common timebases for video. As long as you know the timebases you'll have to deal with before muxing you should be fine.

An important thing to note is that floating point should not be used at all (we want precision). All we have is the Matroska timebase x/1.000.000.000 s (x=TimestampScale) and the source material timebase(s) (a/44.100, b/48.000, c/24, d*10001/30000, etc). They are all fractions. So we should be able to find something that works with just fractions, using common denominators, fraction reduction, etc. It can get to large numbers very quickly as there are multiple tracks with different timebase (or odd fractions when a track uses VFR, see above).

What we have now is a timestamp for each Block as a fraction of TimestampScale/1.000.000.000.

What we want is a timestamp for each Block as a fraction of the source material. The difference between the two values is still a fraction. We can store this difference as a fraction. And we must also store the source material fraction.

Now we just have to do the math the find this "difference as a fraction". In particular to minimize the storage needed to do so if possible (if not, mandating BlockGroup for precise tracks is always an option). If we can fit in inside the (3) reserved bits of the SimpleBlock it would be perfect.

t11s commented 4 years ago

ISO/IEC 14496-12 "ISO base media file format" uses a "timescale" (counts per second) and "media sample durations". If timescale=30000 and media sample duration is 1001, you get NTSC fractional frame rate.

Similarly, ISO/IEC 14496-10 "Advanced Video Coding" has a clock tick defined as num_units_in_tick divided by the time_scale (see equation C-1). The presence of these in VUI is indicated by the timing_info_present_flag. For NTSC, time_scale may be 30000 and num_units_in_tick may be 1001.

robUx4 commented 4 years ago

Following my "pure rational numbers" approach we can say the following, for a Track sampled at the original frequency, stored in a Matroska Segment with TimestampScale:

The real timestamp for each sample S is: real(S) = S / frequency

The Matroska timestamp for the same sample is matroska(S) = S * TimestampScale / 1,000,000,000

The Cluster timestamp is just a value to add to S to get the proper value, so we can skip it for now. As we just check the rational values, the rounding introduced by divisions is not taken in account.

The difference between the real timestamp and the one we get from Matroska is:

real(S) - matroska(S)
= S / frequency - S * TimestampScale / 1,000,000,000
= (S * 1,000,000,000) / (frequency * 1,000,000,000) - (S * TimestampScale * frequency) / (frequency * 1,000,000,000)
= S * (1,000,000,000 - TimestampScale * frequency) / (frequency * 1,000,000,000)

We can already deduce a few things from this:

That gives some sampling frequencies where it's possible to achieve 0 error per sample:

That leaves out a lot of common ones:

The other way to reduce the error, is to reduce the value of S. We already effectively reduce the value we store to a 16 bits integer, so the value is always between -32,768 and 32,767. If we were to store the error in the remaining 3 bits of a SimpleBlock that's still 13 bits too many.

By limiting the possible values of S in a Cluster to [-4,3] (3 bits), in other words 8 frames, it is possible to store each frame with the Matroska timestamp and the error based on TimestampScale * frequency. This is also feasible because audio is usually not found in 1 samples, but by chunks of samples in one Frame. Sometimes all chunks have the same amount of samples, sometimes not, but each amount of samples is based on the same multiple (worst case scenario is many chunk sizes unrelated). For video that means at most 8 frames per Cluster, for a 29,97 fps file that's 267ms. This is very small.

A Block has one extra free bit, so we could double these values. That's still very small IMO. And that's the case where the TimestampScale is precisely adjusted for one track. When you have 2 or more, finding a value of TimestampScale that works well with all frequency becomes even harder.

I think the scope where it works, even with the proper muxing guidelines, is too narrow to be worth using all the reserved bits. In particular because common frequencies like 44100 Hz or 30000/1001 fps will introduce errors no matter what and will need to use this system.

There could be other clever ways to do this. We could use a bit in the Block that says the timestamp "shift" is stored after/before the Block data, but that would be incompatible with all existing readers. That would be equivalent to using a new lacing format.

Another way would be to force using a BlockGroup to have precise timing and store the "shift" in a new element. It might only need 16 bits of storage, so that would translate in 3 extra octets per BlockGroup

robUx4 commented 4 years ago

It seems one of the aspect of this not discussed is how the rounding of the current system works and how it could be adapted. We assume that we start with the current system and try to fit the correct fraction in there. We may do the other way around, ie have the fraction and use that to set the Block/Cluster timestamp value. The rounding error is then on older parsers assuming a timestamp value when in fact it's another value. But the old system is already known to be imprecise/inaccurate. It's not assumed to be sample precise. So a little more, a little less rounding error should not be a big deal.

What we cannot really do is add some information per-track to modify how the Block/SimpleBlock values are interpreted. That would break backward compatibily. For that we would need BlockV2 and SimpleBlockV2.

So we could store the TimestampScale and a fraction that is the actual fraction it's based one.

Let's see what happens for 29.97fps video, or 30000/1001 Hz. The most accurate TimestampScale is 33,366,667 (nanosecond per frame/lace, rounded). We also store the Segment timestamp fraction as {30000, 1001}:

Frame Number New Block Value Old Parser timestamp Real timestamp Difference
0 0 0 ns 0 ns 0 ns
1 1 33366667 ns 33366666 ns 1 ns
2 2 66733334 ns 66733333 ns 1 ns
3 3 100100001 ns 100100000 ns 1 ns
4 4 133466668 ns 133466666 ns 2 ns
5 5 166833335 ns 166833333 ns 2 ns
6 6 200200002 ns 200200000 ns 2 ns
7 7 233566669 ns 233566666 ns 3 ns
8 8 266933336 ns 266933333 ns 3 ns
9 9 300300003 ns 300300000 ns 3 ns
10 10 333666670 ns 333666666 ns 4 ns
.. .. .. .. ..
65532 65532 2186584421844 ns 2186584400000 ns 21844 ns
65533 65533 2186617788511 ns 2186617766666 ns 21845 ns
65534 65534 2186651155178 ns 2186651133333 ns 21845 ns
65535 65535 2186684521845 ns 2186684500000 ns 21845 ns

The Old Parser timestamp is the timestamp older parsers would see: Block Value TimestampScale. The Real timestamp is the one using the fraction: Block Value 1001 / 30000.

For 44100 Hz audio we get the following, with a TimestampScale of 22,676 (nanosecond per frame/lace, rounded).

Frame Number New Block Value Old Parser timestamp Real timestamp Difference
0 0 0 ns 0 ns 0 ns
1 1 22676 ns 22675 ns 1 ns
2 2 45352 ns 45351 ns 1 ns
3 3 68028 ns 68027 ns 1 ns
4 4 90704 ns 90702 ns 2 ns
5 5 113380 ns 113378 ns 2 ns
6 6 136056 ns 136054 ns 2 ns
7 7 158732 ns 158730 ns 2 ns
8 8 181408 ns 181405 ns 3 ns
9 9 204084 ns 204081 ns 3 ns
10 10 226760 ns 226757 ns 3 ns
.. .. .. .. ..
1636 1636 37097936 ns 37097505 ns 431 ns
1637 1637 37120612 ns 37120181 ns 431 ns
1638 1638 37143288 ns 37142857 ns 431 ns
1639 1639 37165964 ns 37165532 ns 432 ns
1640 1640 37188640 ns 37188208 ns 432 ns
.. .. .. .. ..
65532 65532 1486003632 ns 1485986394 ns 17238 ns
65533 65533 1486026308 ns 1486009070 ns 17238 ns
65534 65534 1486048984 ns 1486031746 ns 17238 ns
65535 65535 1486071660 ns 1486054421 ns 17239 ns

The difference is less than one sample. When packed at 40 samples per frames (shortest packing in @rcombs example). We would then use a fraction of {40, 44100} and a TimestampScale of 907029 :

Frame Number New Block Value Old Parser timestamp Real timestamp Difference
0 0 0 ns 0 ns 0 ns
1 1 907029 ns 907029 ns 0 ns
2 2 1814058 ns 1814058 ns 0 ns
3 3 2721087 ns 2721088 ns -1 ns
4 4 3628116 ns 3628117 ns -1 ns
5 5 4535145 ns 4535147 ns -2 ns
.. .. .. .. ..
47392 47392 42985918368 ns 42985941043 ns -22675 ns
47393 47393 42986825397 ns 42986848072 ns -22675 ns
47394 47394 42987732426 ns 42987755102 ns -22676 ns
47395 47395 42988639455 ns 42988662131 ns -22676 ns
47396 47396 42989546484 ns 42989569160 ns -22676 ns
47397 47397 42990453513 ns 42990476190 ns -22677 ns
.. .. .. .. ..
65533 65533 59440331457 ns 59440362811 ns -31354 ns
65534 65534 59441238486 ns 59441269841 ns -31355 ns
65535 65535 59442145515 ns 59442176870 ns -31355 ns

We get less than 1 sample error with 47393 frames stored, or 42s worth of samples in a Cluster.

The worst case scenario is the highest, not easily divisible, frequency 352800. It gives:

Frame Number New Block Value Old Parser timestamp Real timestamp Difference
0 0 0 ns 0 ns 0 ns
1 1 2834 ns 2834 ns 0 ns
2 2 5668 ns 5668 ns 0 ns
3 3 8502 ns 8503 ns -1 ns
4 4 11336 ns 11337 ns -1 ns
5 5 14170 ns 14172 ns -2 ns
.. .. .. .. ..
6066 6066 17191044 ns 17193877 ns -2833 ns
6067 6067 17193878 ns 17196712 ns -2834 ns
6068 6068 17196712 ns 17199546 ns -2834 ns
6069 6069 17199546 ns 17202380 ns -2834 ns
.. .. .. .. ..
65533 65533 185720522 ns 185751133 ns -30611 ns
65534 65534 185723356 ns 185753968 ns -30612 ns
65535 65535 185726190 ns 185756802 ns -30612 ns

Here we achieve less than on sample error when there is less than 6067 samples in a Cluster. This can be doubled by using signed values for the Block timestamp value. The range to get less than one sample error becomes [-6067,6066]. And by packing samples by at least 11 samples, we always get less than 1 sample error. With 22 samples we get less than half a sample duration error, which should be enough with rounding.

So with single track files we can probably achieve sample precision easily.

With mixed frequencies it becomes more complicated. For example the 29.97 fps video with the 44100 Hz audio. We have 1001/30000 and 1/44100 so the fraction to use would be 1001/reduced(30000, 44100) where reduced(A, B) is each number multiplied and divided by their Greatest Common Denominator. In this case (30000*44100)/100 = 13230000. That gives a round TimestampScale of 75661 ns/tick.

That gives these Blocks: Block Value Old Parser timestamp Real timestamp Difference
0 0 ns 0 ns 0 ns
1 75661 ns 75661 ns 0 ns
2 151322 ns 151322 ns 0 ns
3 226983 ns 226984 ns -1 ns
4 302644 ns 302645 ns -1 ns
5 378305 ns 378306 ns -1 ns
6 453966 ns 453968 ns -2 ns
7 529627 ns 529629 ns -2 ns
8 605288 ns 605291 ns -3 ns
9 680949 ns 680952 ns -3 ns
10 756610 ns 756613 ns -3 ns
11 832271 ns 832275 ns -4 ns
12 907932 ns 907936 ns -4 ns
.. .. .. ..
441 33366501 ns 33366666 ns -165 ns
.. .. .. ..
882 66733002 ns 66733333 ns -331 ns
.. .. .. ..
1323 100099503 ns 100100000 ns -497 ns
.. .. .. ..
32634 2469121074 ns 2469133333 ns -12259 ns
For the video track we would get something like this: Frame Number New Block Value Old Parser timestamp Real timestamp Difference
0 0 0 ns 0 ns 0 ns
1 441 33366501 ns 33366666 ns -165 ns
2 882 66733002 ns 66733333 ns -331 ns
3 1323 100099503 ns 100100000 ns -497 ns
.. .. .. .. ..
74 32634 2469121074 ns 2469133333 ns -12259 ns
.. .. .. .. ..
148 65268 4938242148 ns 4938266666 ns -24518 ns

We can store almost 5s In a Cluster.

For the audio track, on the other hand, we cannot recover each sample easily Sample Number Real timestamp Block Value
0 0 ns 0
1 22675 ns ~0
2 45351 ns ~1
3 68027 ns ~1
4 90702 ns ~1
5 113378 ns ~1
6 136054 ns ~2
7 158730 ns ~2
.. .. ..
300 6802721 ns ~90

The Block Value doesn't map to an exact sample timestamp (and vice versa).

It seems that if we apply a factor of 3 we may get better results. So we could have a Segment fraction of 1001/3*13230000, with a rounded TimestampScale of 25220 ns/tick.

Sample Number Block Value Old Parser timestamp Real timestamp Difference
0 0 0 ns 0 ns 0 ns
1 1 25220 ns 22675 ns 2545 ns
2 2 50440 ns 45351 ns 5089 ns
3 3 75660 ns 68027 ns 7633 ns
4 4 100880 ns 90702 ns 10178 ns
5 5 126100 ns 113378 ns 12722 ns
6 6 151320 ns 136054 ns 15266 ns
7 7 176540 ns 158730 ns 17810 ns
8 8 201760 ns 181405 ns 20355 ns
9 9 226980 ns 204081 ns 22899 ns
10 10 252200 ns 226757 ns 25443 ns
.. .. .. .. ..
65534 65534 1652767480 ns 1486031746 ns 166735734 ns
65535 65535 1652792700 ns 1486054421 ns 166738279 ns

We lose about 1 sample precision every 10 samples, or 10%. For a full Block that's about 166ms shift (or rather half when using signed 16 bits). That's a lot. Even packed at 40 samples per frame that's still about 20ms, when such a frame is 1 ms.

If we use the full fraction {1001, 30000*44100} we cannot store more than one video frame per Cluster.

There doesn't seem to be a system where it works by storing the Block value as a real fraction value. At least when mixing "heterogeneous" frequencies. It works with single tracks or frequency that are easily divisible. And not if we want to keep backward compatibility (Block/SimpleBlock).

robUx4 commented 4 years ago

A little background on this, for adaptive streaming it's important when you switch to one "quality" (representation) to another one to switch exactly the frame and audio you want. I don't know if they are sample exact for audio, especially as each codec (or different encoding parameters) may pack a different amount of samples per frame. So the boundaries don't totally overlap. Maybe there's an offset that tells on which sample to start. Or an exact clock gives the exact timestamp for each sample in each representation anyway.

Give that, the important phrase here is

So with single track files we can probably achieve sample precision easily.

In adaptive streaming you don't (usually) use muxed tracks. So you can pick each channel independently with the best possible choice at any given time. So in these conditions we can be sample precise. All we need is to tell the original clock (numerator/denominator) of the Track. A new parser would use that value with the Block timestamp value. Older parsers would not see it and would use the Block timestamp value with the global TimestampScale. As described above, the difference is minimal, as long as the TimestampScale is matching the fraction.

I'll send a proposal for new elements to store this fraction and the necessary changes on how to interpret the timestamps.

robUx4 commented 4 years ago

The larger problem is because we want a rationale number that works for all the tracks (theoretically possible) and at the same time have a sensible value that will not require huge values of the numerator for each timestamp in a Block. We only have 16 bits there. As seen above in most cases it doesn't work. And that's because we have one global "clock", defining all Block (and Cluster and more) "ticks".

We could however alter the interpretation of each Block value to adjust to a better "clock" that works for that track. So that we end up with a better range of values for the numerator. And luckily we already have TrackTimestampScale ! It's a float number to apply to each Block tick value to get the proper timestamp for that Block (or Track in general). It is currently marked as deprecated because it's usage was limited, as it's a float, and it was supposed to allow changing timestamps without remuxing a track. But that's not convenient at all.

But just like we introduce a rational number to use instead of the TimestampScale, we can use a rational number to use instead of TrackTimestampScale. And with TrackTimestampScale the rounded value of this rational number. Despite being deprecated, TrackTimestampScale is supported in at least libavformat (ffmpeg) and VLC demuxers. It's possible it's not supported in a lot of demuxers, especially since it was marked as deprecated anyway. For example it's not supported in WebM. But that's less of a problem as they tend to add new elements when they need it.

So a Block timestamp would be ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale The formula in the old website (and current RFC draft) is incorrect as it applied the TrackTimestampScale on the Cluster tick as well. The vlc code seems to use it incorrectly (I can fix) but the libavformat seems to be correct. In both case adding support for sample accurate timestamps would mean fixing those as well.

In a new parser TimestampScale and TrackTimestampScale would both be rational numbers. In old parser the TimestampScale would be the rounded nanosecond based value and TrackTimestampScale the floating point value value of TrackTimestampScale. They would be less precise but they were never meant to be anyway.

robUx4 commented 4 years ago

So let's take the previous example that didn't work: 29.97 fps video with the 44100 Hz audio. Now we can have TimestampScale TrackTimestampScale = 1001/30000 for video and TimestampScale TrackTimestampScale = 1/44100 for audio (or 40/44100 if samples are always packed by 40 but we don't even need that). We can represent 65536 ticks for each Track in a Cluster.

Now the critical part is the Cluster tick value. To have sample accurate values on each Block it also has to provide ticks that are sample accurate for both tracks. In this case a (rational) TimestampScale of 1/(30000 441) should do it. All ticks on the 1/44100 clock are represented (0, 300, 600, 900, 1200, etc) on this clock. All ticks on the 1001/30000 clock are also represented (0, 1001 441, 2 1001 441, 3 1001 441, etc) on this clock. In a 24h movie that's 24 60 60 30000 441 ticks which is still a small value (0x10A 2466 8800 in hexadecimal) compared to the 64 bits room we have for each Cluster Timestamp.

There is a slight problem though. The rounded TimestampScale would be 76 ns. Over 24h the "old clock" tick would be 24 60 60 30000 441 * 76 ns or 86,873.472 s or 24,13 h. That's a 0.548% error. In general that system is used with a 1ms precision resulting in even more innacurate values for the 33,366 ms video durations. So it shouldn't have any impact.

Now what is the magic formula to get the proper rational TimestampScale (TimestampNumerator and TimestampDenominator) ? It looks like TimestampDenominator = SamplingFreq A Denominator * SamplingFreq V Denominator / GCD( SamplingFreq A Denominator, SamplingFreq V Denominator ) Where the GCD() function gives the Greatest Common Denominator for A and V. But that's not the value we used. Both 30000 and 441 are divisible by 3. So it should be 10000 * 441. That gives a legacy TimestampScale of 227 ns, which should give a smaller difference between the 2 systems.

with this value the rational TrackTimestampScale values would be 100/1 for audio and (1001 * 147)/1 for video, also stored as floating values in the legacy field. For audio the Block ticks would result in:

Block timestamp = ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale
Block timestamp = ( ( Block tick * 100/1 ) + Cluster tick ) * 1/(10000 * 441)
Block timestamp = ( Block tick * 100/1 ) * 1/(10000 * 441) + Cluster tick * 1/(10000 * 441)
Block timestamp = Block tick * 100 * / (10000 * 441) + Cluster tick * / (10000 * 441)
Block timestamp = Block tick / (100 * 441) + Cluster tick / (10000 * 441)
Block timestamp = Block tick / 44100 + Cluster tick / (10000 * 441)

For video the Block ticks would result in:

Block timestamp = ( ( Block tick * TrackTimestampScale ) + Cluster tick ) * TimestampScale
Block timestamp = ( ( Block tick * 1001 * 147 ) + Cluster tick ) * 1/(10000 * 441)
Block timestamp = ( Block tick * 1001 * 147 ) * 1/(10000 * 441) + Cluster tick * 1/(10000 * 441)
Block timestamp = ( Block tick * 1001 * 147 ) / (10000 * 441) + Cluster tick / (10000 * 441)
Block timestamp = Block tick * 1001 / 30000 + Cluster tick / (10000 * 441)

It seems we have a system that works well for two tracks. It work just as much with more tracks as long as the GCD of all SamplingFreq Denominator is big enough, resulting in a rounded legacy TimestampScale that should be above 50 and MUST NOT be 1 or even less 0 anyway.

robUx4 commented 4 years ago

There is a small problem with the audio in the example above, we only get 65536/44100 second possible per Cluster. But given audio samples are usually packed by a fixed number of samples or a variable number of samples with a base common number, or even a multiple of 4. That packing unit number can be set as the numerator of the audio TrackTimestampScale which would then me Packing Unit * 100 / 1. That multiplies the possible amount of audio per Cluster. Even a Packing Unit of 4 would give 5.9s audio samples per Cluster which is good enough.

robUx4 commented 4 years ago
So what happens when using only the legacy values to compute the timestamps. In the example above, the TimestampScale is 227 ns. The audio TrackTimestampScale is 100.0f and the video TrackTimestampScale is 147147.0f. The first audio ticks are represented like this Audio Tick Real timestamp (ns) Block Value Timestamp (ns) Difference
0 0.0 0 0 0.0
1 22675.7 1 22700 -24.3
2 45351.5 2 45400 -48.5
3 68027.2 3 68100 -72.8
4 90702.9 4 90800 -97.1
5 113378.7 5 113500 -121.3
.. .. .. .. ..
65533 1486009088.0 65463 1486010112 -1024.0
65534 1486031744.0 65464 1486032768 -1024.0
65535 1486054400.0 65465 1486055552 -1152.0

The Block value being the integer stored in the Block based on the real timestamp, the TimestampScale and the TrackTimestampScale. The second timestamp being the timestamp a parser would deduce from the Block Value, , the TimestampScale and the TrackTimestampScale. The difference between the deduced and and real timestamp happens because the 227 ns TimestampScale is not an exact value.

In the end, in the whole Cluster, the difference is always less than 11392 ns. That's less than one audio tick. That means even without adding any element, just reviving the floating point TrackTimestampScale, we could store sample accurate timestamps. We just need to apply the rules above to find the proper TimestampScale and the TrackTimestampScale, as if they were handled as rational numbers.

robUx4 commented 4 years ago

We could however add the original clock in each track to give an accurate way for the reader to round the values (ie, get the values from the second column, when the values of the fourth column are computed). We have the SamplingFrequency but it's in floating point. We should also do the same for video tracks. So probably a rationale value stored with the generic TrackEntry fields.

robUx4 commented 4 years ago

I did a test program to run different scenarii: all the audio/video sampling frequencies listed above mixed (1 audio/1 video). The program can be found here.

The result of the run of this program is found in this dirty Markdown file.

In some cases there are some rounding errors that can't be recovered. There are also many cases where the possible duration of audio in a Cluster is way too small. So I added some examples with common packing and then the duration is much more usable. And then it also avoids the rounding errors (see "Audio 11025 Hz (128 packed)" for example).

The video errors are always negligible as their occur after very long durations. Durations that are impossible to reach given the duration constraints on the audio.

Maybe I can add an extra layer and try the common packing sizes mentioned by @rcombs. But from a first look it seems to solve both the limited duration of audio in a Cluster and the possible rounding errors.

robUx4 commented 4 years ago

After computing the TrackTimestampScale of each track as a floating value, rather than an integer (to match what a rationale value would be), we can counter the rounding error introduced by the small TimestampScale in the most tricky cases.

In the end the only errors (half a tick, so the wrong sample/tick would be assumed on the output of the demuxer) only occurs on video tracks, in rare cases and after long duration in a Cluster (145s minimum which is a lot).

The only problem remaining is that the amount of audio samples possible in a Cluster with such small TimestampScale values is small. Sometimes only 0.19 second is possible (16 bits ticks). That can be solved by packing samples to achieve a possible duration in a Cluster over 5s (the common amount acceptable). For the cases where only 0.19s is possible (352800 Hz) packing at least by 263 samples should be sufficient. In most case even packing 10 samples is sufficient.

robUx4 commented 4 years ago

In the end the packing problem is directly related to the sampling frequency of the audio. This problem exists regardless of the sample accuracy of timestamps. High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster.

This problem aside, we can always use TrackTimestampScale with a rounded TimestampScale (based on audio denominator * video denominator / GCD to achieve sample accuracy.

Mixing more than one audio track might cause some problems if the sampling frequency differ too much (doesn't fit the GCD). But for 2 tracks it's achievable all the time.

robUx4 commented 4 years ago

I made a small calculation error in my tests as the original sample frequency numerator was not used to compute the real timestamp. With examples where the numerator was artificially inflated (1/24 = 1000/24000 = 2000/48000) to try to match audio ranges it gave an incorrect error. In fact in all cases, there is no error on the audio or video tracks.

The TrackTimestampScale countering the rounding of the TimestampScale is so efficient, it works even with a TimestampScale of 1 ns in all tests. Which means it also works regardless of the number of tracks and their sampling frequencies. It could even be used for frequencies higher than a GHz (< 1 ns period).

robUx4 commented 4 years ago

So the real problem left if the amount of audio possible per Cluster. As said before, High sampling frequency requires enough packing of samples to fit a useful duration in a Cluster. This is not a new problem. And the TrackTimestampScale has no effect on this (I double checked). There are 65536 ticks per Cluster per Track possible (always with the proper TrackTimestampScale, now it always has an optimum range).

Audio codec usually pack samples with a fixed amount of samples (or a few possible fixed values that may change in the same stream). Raw audio can do the same. In this case we can compute the TrackTimestampScale based on the "packing frequency" rather than the sampling frequency. For example the "packing frequency" of 10 samples packed at 44100 Hz would be 4410 Hz. Allowing 10x more duration per Cluster. In other words, each ticks is worth 10x more duration than without packing.

If we consider it should always be possible to store at least 5s of audio per Cluster, then the problem starts at frequencies higher than 13107 Hz (65536 ticks / 5 s). That's pretty much all the time.

With packing we don't get the timestamp of each individual sample. We only get the timestamp of the first sample of each pack of sample. But since we know the sampling frequency of the audio (SamplingFrequency element) we can tell the exact timestamp of all the other samples.

robUx4 commented 4 years ago

For Variable Framerate of video (I suppose it's rare for audio) there is another problem. There isn't one frequency. For example there might be film source (24 fps) and NTSC video source (29.97 fps) mixed in the same Segment. Or video captures that are sometimes at 60 fps, sometimes 144 fps and sometimes values that just occur when they can (if the game is able to control the V-Sync directly).

I suppose other containers will also have a hard time giving an exact timestamp value for each frame.

In the case of 2 fixed sources mixed together, it should be possible to accommodate the TrackTimestampScale to both using the rationale fractions. It will reduce the duration possible for that track. But even for 121 and 123 fps that gives a mixed frequency of 14 883 Hz (with ticks from one or the other falling on exact ticks of this clock). That's more than 13107 Hz which means we can't store 5s in a Cluster, but it's pretty close (4.4 s).

For too many or too heterogeneous sources there's not really a good solution. But these sources are doomed to never have accurate timestamps anyway. In that case a resolution of 0.1 ms (10000 Hz) should give a good estimate and enough duration (6.5 s) per Cluster.

robUx4 commented 4 years ago

Given all this I think #437 is a good all around solution. It may not even require to store the exact fraction of the original (although it's probably needed to remux into other containers).

TrackTimestampScale has been in the Matroska specs forever and is supposed to be used by demuxers. So extending it to newer versions of Matroska should be a no brainer. Unfortunately there are high chances that's it's not used properly. Since noone really used it so far (AFAIK) it's usually assumed to be 1.0, and it's discarded, all the math on timestamps being done with integers.

The proposed solution radically changes that. Almost all the time the TrackTimestampScale has a value very far from 1.0 (up to 10416667 in the frequencies I tested). For all parsers not using the TrackTimestampScale, only the first timestamp of a Cluster will be usable (tick 0) the rest will look very odd (usually way too small values). It should always be possible to adjust the TimestampScale so that one track has a TrackTimestampScale of 1.0. It should be the track with the highest "packing frequency". All other tracks will be almost unusable to a non-conformant parser, but at least one track will be (most likely the best audio track).

robUx4 commented 4 years ago

I think libavformat and the libmatroska based demuxers (including VLC) should handle this properly. That already covers a lot of players, demuxers, muxers.

TrackTimestampScale (formerly known as TrackTimecodeScale) is not part of WebM. So parsers exclusively dealing with WebM (the Firefox one, dunno about Chromium) may have issues with this.

Most TV/streaming boxes are probably not using libavformat or libmatroska so I'm not sure they handle this properly either.

robUx4 commented 4 years ago

The fact that each TrackTimestampScale should be computed using the "packing frequency" of the track will also add some friction. Matroska has always be codec agnostic, ie it doesn't need to know anything about a codec to be muxed (although it does store information about the codec). Now to mux "accurately" we need to know, before writing/having any frame, how many samples will be found in each codec frame. Most codec have a fixed value so it won't be too hard. But modern codecs have many window sizes. That can become tricky to know exactly what to use, in some cases there might not even be a common factor. In that case the "packing frequency" = "sampling frequency" and we can't store a lot of samples per Cluster. or we just give up and sample accuracy.

So I think with all (audio) codec we should mention the number of samples per frame that can be used safely. (sampling frequency / that number = packing frequency). That should be done in the codec specs.

robUx4 commented 4 years ago

There is also the question of Cues and Chapters. The timestamps are stored in "absolute" values (which means in nanoseconds). Introducing the TrackTimestampScale as "unrounded" floating point means the actual timestamps of a frame is not an exact multiple of the global TimestampScale anymore.

So Cues/Chapters referencing a particular frame (or audio block) should use the same value that will come out of the demuxer. The value is not always exactly the same value that was written but the error is small enough not to mistake it for another sample. But since demuxers/players will compare values when seeking it's better to match exactly the value that will be read in the file.

Using packed audio with a factor on the TrackTimestampScale also means it won't be possible to reference (audio) samples individually. The granularity will depend on the factor applied to the TrackTimestampScale. For example for samples packed by 40 the factor applied may be 2, 4, 5, 8, 10, 20, or 40. That allows more or less duration per Cluster / more or less Cue precision inside a frame of a Block.

robUx4 commented 4 years ago

Actually once you have the timestamp of the first sample, you don't need to know the rest of the timing in the packed audio. It's outside of the container level that the right sample will be picked.

The timestamp in Cues/Chapters can either use the real timestamp (in nanosecond) of the sample. Or they could shift it the same way the timestamp of the first sample in the Block is shifted. The difference is a rounding error that in the end will result in the sample referenced. Except for this the container doesn't use the values for exact comparison so it will have no impact there. I think it's better to use the real sample timestamp in that case.

robUx4 commented 3 years ago

I modified libavformat to be able to create files with the right TrackTimestampScale to get accurate timestamps. The code is in this branch: https://github.com/FFmpeg/FFmpeg/compare/master...robUx4:tracktimestampscale/0

To enable it add the -sample_accurate 1 option when muxing. It has no impact when muxing WebM. For now the packing amount is forced to 8 samples for audio.

Files created this way play in VLC after some changes in VLC and libmatroska. It turns out libmatroska is not as ready to use the TrackTimestampScale as I thought :( Event KaxTrackEntry has a SetGlobalTimecodeScale() but it's an unsigned integer, when it should be a float/double...

robUx4 commented 3 years ago

Related libmatroska and VLC patches

robUx4 commented 3 years ago

I think we can close this issue since now it's a matter of TrackTimestampScale support in muxers and demuxers/players.

We also need to provide the number of samples per block per codec but that's for #439.

Xaymar commented 3 years ago

Edit: This appears to already have been implemented in #425.

I'm fairly sure that the proposed fix is not a fix at all, just a workaround around the underlying issue. Approximating framerates in multiples of nanoseconds will not give you accurate results, instead a numerator / denominator structure should be adopted for Matroska v4/v5. Below is an example of how this field could be defined:

  <element name="TrackTimestampNumerator" path="\Segment\Tracks\TrackEntry\TrackTimestampNumerator" id="0x1E7" type="uinteger" minver="4" range="not 0" maxOccurs="1">
    <documentation lang="en" purpose="definition">If present, defines the numerator used for further time stamp calculations. The formula for calculating the time stamp thus becomes (n * TrackTimestampNumerator / TrackTimestampDenominator) seconds, where n is an unsigned integer defined by the Timestamp element in the cluster plus the block offset. Applications are free to convert the fractional time code to any internal format, but are recommended to keep at least 100 microseconds of precision.</documentation>
    <extension type="webmproject.org" webm="0"/>
    <extension type="libmatroska" cppname="TrackTimecodeNumerator"/>
  </element>
  <element name="TrackTimestampDenominator" path="\Segment\Tracks\TrackEntry\TrackTimestampDenominator" id="0x1E8" type="uinteger" minver="4" range="not 0" maxOccurs="1">
    <documentation lang="en" purpose="definition">If present, defines the denominator used for further time stamp calculations. The formula for calculating the time stamp thus becomes (n * TrackTimestampNumerator / TrackTimestampDenominator) seconds, where n is an unsigned integer defined by the Timestamp element in the cluster plus the block offset. Applications are free to convert the fractional time code to any internal format, but are recommended to keep at least 100 microseconds of precision.</documentation>
    <extension type="webmproject.org" webm="0"/>
    <extension type="libmatroska" cppname="TrackTimecodeDenominator"/>
  </element>

This permanently fixes the issue of poor time bases in Matroska by using simple math.

mkver commented 3 years ago

I think we can close this issue since now it's a matter of TrackTimestampScale support in muxers and demuxers/players.

We also need to provide the number of samples per block per codec but that's for #439.

This sounds like you think your proposal to be a solution to this problem. I disagree:

a) I thought that we were looking for a compatible change, not for something that won't work with older players. Relying on players supporting a deprecated feature that is intended to solve something else and hardly ever used is bad enough, but you are using it with semantics that differ from the current specifications. The latter point alone is IMO reason enough not to reuse this field. b) You are often comparing the maximum error that can happen inside a cluster when using the new parsing method vs when using the old parsing method (by which you mean a demuxer supporting a)); yet this is IMO flawed: Muxers don't add the duration of all the frames written so far to get the timestamp of the next frame; instead they convert the timestamp of frame to the timebase used by the muxer. And if the source timestamps are exact, then the errors won't accumulate, there will just be a bit of jitter (e.g. when using the default 1ms timestamps, DTS will have timestamps of 11ms, 21ms, 32ms instead of the exact multiples of 10 2/3ms). Yet if I am not mistaken, then making the timestamps precise for the new parser disallows this jitter and will make the errors accumulate for an old parser. c) I furthermore don't like floats and the potential for non-portability that they bring.

I think it has already been mentioned earlier, but there is a way to add precise timestamps in a compatible way, by adding a rational, track-dependent timebase to each TrackEntry and by specifying a precise way to convert from the time as currently parsed to the exact time: Just round the inexact time to the nearest integral multiple of the track's timebase (the rounding in case it is exactly in the middle of two such integral multiples also needs to be exactly defined (e.g. "always round up in this case")).

(The above procedure has the advantage that we don't need the least-common-multiple of the timebases of multiple tracks; yet it shares with the other procedures the disadvantage that if a track's timebase is small, then we need to use a small global timebase (it must be smaller than the track's timebase so that every of the track's possible timestamps is attainable; of course, if we already know that the content is cfr, then we can use this knowledge to choose a bigger timebase, thereby allowing longer clusters. This is similar to your proposal.)

Of course, we would also need to add a rational analogue for the default duration (or we use a method just like the above for it: Round the ordinary default duration to the nearest multiple of the track's timebase).

PS: Sorry for ignoring this issue for so long.

robUx4 commented 3 years ago

Correct, it's a workaround that's not purely mathematically correct. But it's compatible with existing players, at least spec wise.

I'm fine with adding a mathematically correct feature in Matroska v5. In the end it will likely be a fractional value of the TrackTimestampScale so that players that only understand that value will also get more sample accuracy.

mbunkus commented 3 years ago

To be honest, I'm not a fan of the proposed method. As we all seem to agree, it's at best a workaround. And here's where I don't like it: workarounds are a tradeoff between different concerns, e.g. the amount of time required to fix it properly vs. the remaining incorrectness of the workaround.

The thing is, the workaround doesn't actually work with existing players out there. While it is comparatively easy to implement support for the method in the most notable software implementations (VLC, ffmpeg/libav*, MKVToolNix), it is completely unrealistic to assume that existing hardware players will ever be updated to support this method, making all files created with this method unplayable on hardware devices.

For me this tilts the balance so far into the wrong direction that I'm not in favor of this method.

hubblec4 commented 3 years ago

I agree with @mbunkus, and I thought we will implement the RationalType.

rcombs commented 2 years ago

For the record, I'm also fond of the solution @mkver mentions (keeping the backwards-compatible imprecise timestamps in Cluster, Block, etc, but rounding them to the nearest multiple of the per-track rational timebase). This just means that when handling files with very short frames, we'd need to require that the legacy timebase be at least twice as precise as the codec rational timebase (so e.g. 453514ns for 48kHz TrueHD).

mkver commented 2 years ago

Let q and q' be timebases with q finer than q' (i.e. q < q'). Let t be exactly representable in q', i.e. t = k' q' with integral k' and let t = (k + r) q with integral k and -0.5 <= r <= 0.5. Then k q = k' q' - r q = (k' + r') q' with r' = -r * q / q'. Also |r'| < |r| <= 1/2 due to the assumption of q being finer than q'.

This implies that if q' is a timebase in which all timestamps of a given track are exactly representable and if q is a finer timebase than q' and if we use "round-to-nearest" for the transformation from q' to q, then the "round-to-nearest" transformation from q' to q will exactly recover the original timestamps. It does not matter what value is used in case the nearest timestamp representable in q is not uniquely determined. It also follows that there is a unique nearest value for the transformation from q to q' as long as one restricts oneself to timestamps exactly representable in q that emanate from timestamps in q' by "rounding-to-nearest". (As an example, consider q = 1/4 ms and q' = 1/3 ms. Then the timestamp 1/3 ms will be rounded to 1 1/4 ms and 2/3 ms will be rounded to 3 1/4 ms; when transformed back, the original values of 1/3 ms and 2/3 ms are recovered and there is no uniqueness problem; but there is for 2 * 1/4 ms, but no value representable in q' leads to this value.)

It is not necessary for this that q is at least double as precise as q'. If it is, then there will never be multiple nearest values during the transformation from q' to q. But as seen above, this is not even a problem.

The same reasoning as above also establishes the following variants: If the muxer always rounds up/down, then the timestamps can be exactly recovered if the demuxer always rounds down/up. This is due to r and r' having different sign (if nonzero).

From the above it follows that the common case of cfr content can be easily supported via an additional rational timebase; moreover, for common codecs, it is not even necessary to remux the files, as the common 1ms timebase is good enough: All that is left to do is add some header elements via mkvpropedit.