protyposis / AudioAlign

Audio Synchronization and Analysis Tool
GNU Affero General Public License v3.0
137 stars 16 forks source link

Trim silence? #26

Open PRiiXX opened 3 months ago

PRiiXX commented 3 months ago

Hi,

I'm currently trying to synchronize two audio tracks but am running into some issues. The two tracks are technically the same (actually they are 2 videos with the audios extracted) with just one exception: on scene/segment changes (3-4 times within 24 minutes), there's about a second of silence - but it's not linear, e.g. track 1 has 800ms of silence while track 2 has about 1600ms of silence. I have not gotten AudioAlign to work in this specific scenario, maybe I'm just too dumb or this type of thing is indeed not supported.

What I would like to achieve is to automatically trim the different moments of silence in track 2, so the silence in track 2 matches the length of the silence in track 1. Because that way both audios would 99.99% match perfectly. Is there any workaround for this? I've tried nearly everything so far without any results, best I could do was set alignment mode to "mid" but that will only make the middle part of the audio aligned.

Would love to know if that's just not possible or I'm indeed stupid.

MarcoRavich commented 2 months ago

Hi there, I would suggest to losslessly cut out the silence from track 2: this should solve the issue.

@mifi's LosslessCut is a great tool to do it.

Hope that helps.

protyposis commented 1 month ago

Hi! Yes, this is possible (with a limitation *). You basically need to use alignment mode all, and have fine granular matching points that are as close to the silence start/end moments as possible (to keep the original audio quality around the edges). The best way to get fine granular matching points in your case is probably by using Dynamic Time Warping to correlate your two audio tracks. See this demo video:

https://github.com/protyposis/AudioAlign/assets/189372/938b293d-20f7-4694-a693-277d0201817d

(*) If the matching points in the actual audio aren't perfectly synchronized, e.g., when generated with fingerprints, AudioAlign will resample the whole audio track and lead to slightly degraded audio quality (because it's not optimized for this use case yet). Dynamic Time Warping avoids this because it yields perfectly synchronized matching points when the audio is similar in both tracks, so only the silence is resampled.