[Question] Are there any conditions where combine or merging would change audio play durations?

As I mentioned in another issue, I'm syncing text to audiobooks. One of the fun challenges with this is combining subtitle files. The second file has to be offset by the first files total play duration.

I'm getting the duration from ffmpeg-python's ffprobe wrapper: https://github.com/kanjieater/SubGen/blob/c54738780b5acf95ecf7799b3cd21b08271b68e5/split_run.py#L107

My current use of merge: docker run -it --rm -u $(id -u):$(id -g) -v "$FOLDER":/mnt sandreas/m4b-tool:latest merge "./$(echo $NAME)_splitted/" --output-file="./$NAME.m4b"

Split: docker run -it --rm -u $(id -u):$(id -g) -v "$FOLDER":/mnt sandreas/m4b-tool:latest split ./$INPUT

My question is:

Are there any conditions where split or merge would automatically change audio play durations?

I'm not using any parameter's that I think would do this, nor do I have a reason to think this happening, but I wanted to double check.

As an aside, if you have any other thoughts or concerns with using an approach like this to combine subtitles I'm certainly open to them.

Are there any conditions where split or merge would automatically change audio play durations?

It's not that easy. Normally, the duration won't change - and it should not change SIGNIFICANTLY at all. Unfortunately, ffmpeg and ffprobe are known to be inaccurate under specific circumstances (rounding ms to the last digit, e.g. 15342ms become 15.340s in ffmpeg).

That was one of the main reasons I wrote tone - which is currently the most accurate tool I know regarding length detection (at least for mp3 and m4b).

So, to answer your question: It is possible, that the duration changes in ranges of 1 to 500ms per track, depending on how accurate the detection mechanism is. So if you are calculating durations via ffprobe it is pretty likely that the changes / shift times of every track sum up from < 10ms at the beginning to more than 500ms at the end of the final merge.

m4b-tool tries to prevent this shift times by FIRST converting the single tracks (e.g. from mp3 to m4a) and THEN using the duration of the temporary m4a files to build the chapters and FINALLY concatening all m4a files to one single m4b using the length from m4a for chapter times. That works pretty well.

This is extremely useful info for me, thank you. There's a few things I'd like to ask about those points your brought up.

Are there any conditions where split or merge would automatically change audio play durations?

I guess to answer this question, m4b-tool does not intentionally add or remove duration from files, but does what it can to keep things accurate as possible by default. Does that sound right?

That was one of the main reasons I wrote tone - which is currently the most accurate tool I know regarding length detection (at least for mp3 and m4b).

How does tone accomplish this?

FINALLY concatenating all m4a files to one single m4b using the length from m4a for chapter times.

What makes this better than just probing each file individually? Is M4A duration more accurate than mp3 duration?
For the purpose of combining subs, I was thinking about extracting chapter information after m4b creates it. I'm not sure how to do that just yet using m4b-tool. Is there a command that allows for that?
For 4, couldn't this be inaccurate if the user provided timestamps that weren't from simply split files? Or additional chapters for instance (musicbrainz)? In which case chapters aren't as reliable for combining subs as looking at the individual mp3/split m4b would be right?

UPDATE: After thinking about this a bit more, it probably makes sense to have some tool go over the entire sub file one last time matching the full audiobook to actually ensure correcting the potential shift. I'll add a 6th here though

Given that ffprobe truncates at millisecond level accuracy. The worst you could "miss" by is 1 millisecond (if a timestamp was 4366.394000 but in reality 4366.394999). Wouldn't this mean that for each file you could at most be off by ~1 millisecond? So for 5 files, 5ms worst case, and 50 files 50ms worst case?

I guess to answer this question, m4b-tool does not intentionally add or remove duration from files, but does what it can to keep things accurate as possible by default. Does that sound right?

Mostly. Converting from mp3 to other formats may change the length by a fraction (a few ms) because of changes in sampling rate, number of frames or other technical internals.

How does tone accomplish this?

It uses atldotnet, a library from a developer with years of experience.

What makes this better than just probing each file individually? Is M4A duration more accurate than mp3 duration?

No it DOES probe every file individually, but AFTER the conversion from mp3 to m4b. Using the detected length of the mp3, then converting and concatening resulted in audio shift.

For the purpose of combining subs, I was thinking about extracting chapter information after m4b creates it. I'm not sure how to do that just yet using m4b-tool. Is there a command that allows for that?

You could just use

m4b-tool meta audiobook.m4b --export-chapters=chapters.txt

but I really would recommend to use tone, its just more accurate

tone dump --format=chptfmtnative audiobook.m4b

For 4, couldn't this be inaccurate if the user provided timestamps that weren't from simply split files? Or additional chapters for instance (musicbrainz)? In which case chapters aren't as reliable for combining subs as looking at the individual mp3/split m4b would be right?

The inaccuracy comes from converting between formats / bitrates, etc. Concatening files does not change lengths. m4b-tool does very much black magic voodoo to get chapters right, even if the timestamps are not 100% accurate (e.g. musicbrainz). It does silence detection, timestamp comparison and other guessing stuff. If you would like to rebuild m4b-tool's behaviour you have a LOT of work to do :-) And if you would like me to explain this in detail it is just too much.

So here is what I would do:

Detect the duration of the original files for every track (with an accurate tool like tone)
Match the original subtitle positions within the srt to the original tracks and store the matching timestamps for every track in ms
Detect the duration of the converted files for every track
Calculate the duration shift between original files and the files AFTER the conversion
Adjust the original subtitle positions based on the calculated shift
- Here use the tracks as "checkpoint" for all shift calculations

This should be pretty accurate, because the shift does not sum up, but is reajusted on every track by comparing the original length and the converted length of a track.

That sounds feasible. Thank you so much for all those details. I added this to my last post, but I'll post it again explicitly:

After thinking about this a bit more, it probably makes sense to have some tool go over the entire sub file one last time matching the full audiobook to actually ensure correcting the potential shift.

I'll add a 6th question here though

Given that ffprobe truncates at millisecond level accuracy. The worst you could "miss" by is 1 millisecond (if a timestamp was 4366.394000 but in reality 4366.394999). Wouldn't this mean that for each file you could at most be off by .000999 truncated , so ~1 millisecond? So for 5 files, 5ms worst case, and 50 files 50ms worst case?

Wouldn't this mean that for each file you could at most be off by .000999 truncated , so ~1 millisecond? So for 5 files, 5ms worst case, and 50 files 50ms worst case?

1 Millisecond accuracy is enough. But ffmpeg / ffprobe are not accurate enough in DETECTING duration in my opinion. SPLITTING works 100% accurate, but the output text in the command line MAY be inaccurate (not MUST).

I personally would prefer json output, if I were you - way easier to parse with python:

ffprobe -v quiet -print_format json -show_format -show_streams "audiobook.mp4" > "audiobook.mp4.json"

But hey, why using python, when you can use tone? :-) Just use it, write the code in C# and submit a PR?! :-) There seems to be a srt parser (https://github.com/AlexPoint/SubtitlesParser) and the chapters are already present in track.Chapters (and 100% accurate).

I would go for the tag command and write a SrtTagger (similar to https://github.com/sandreas/tone/blob/main/tone/Metadata/Taggers/ChptFmtNativeTagger.cs).

Or, if you don't wanna use C# you could also go for a custom JavaScript Tagger:

function srt(metadata, parameters) {
  // write your code here
}

// register your function name as tagger
tone.RegisterTagger("srt");

tone tag "harry-potter-1.m4b" --taggers="srt" --script="srt.js"

Wouldn't this mean that for each file you could at most be off by .000999 truncated , so ~1 millisecond? So for 5 files, 5ms worst case, and 50 files 50ms worst case?

1 Millisecond accuracy is enough. But ffmpeg / ffprobe are not accurate enough in DETECTING duration in my opinion. SPLITTING works 100% accurate, but the output text in the command line MAY be inaccurate (not MUST).

I personally would prefer json output, if I were you - way easier to parse with python:
ffprobe -v quiet -print_format json -show_format -show_streams "audiobook.mp4" > "audiobook.mp4.json"
But hey, why using python, when you can use tone? :-) Just use it, write the code in C# and submit a PR?! :-) There seems to be a srt parser (AlexPoint/SubtitlesParser) and the chapters are already present in track.Chapters (and 100% accurate).

I would go for the tag command and write a SrtTagger (similar to sandreas/tone@main/tone/Metadata/Taggers/ChptFmtNativeTagger.cs).

Or, if you don't wanna use C# you could also go for a custom JavaScript Tagger:
function srt(metadata, parameters) {
  // write your code here
}

// register your function name as tagger
tone.RegisterTagger("srt");
tone tag "harry-potter-1.m4b" --taggers="srt" --script="srt.js"

The ffmpeg-python library let's you access ffmpeg like a library, so it's pretty convenient already. The other libraries I'm using for AI speech to text, Whisper, all of it's wrappers and forks are in python. In addition the fancy dynamic programming algorithm a friend wrote for matching transcripts of vtt files to "rough" matches is also in python. Basically everything was already in python and working well, so it made sense. For now it's good enough, but I'll keep this in mind to do some testing in the future. I'll close for now - thanks for the help!

sandreas / m4b-tool

[Question] Are there any conditions where combine or merging would change audio play durations? #224