readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.53k stars 233 forks source link

Is it possible to detect when a correct alignment is not possible? #302

Open zxul767 opened 1 year ago

zxul767 commented 1 year ago

I'm exploring possibilities on how to gauge whether a transcription algorithm did a good job when we have no supervision available (i.e., no annotated dataset).

It occurred to me that perhaps one way to do this would be to compute some kind of reconstruction score on the audio domain (when doing the forced alignment):

(audio) --> [transcribe] --> (text) --> [force-align] --> (alignment score)
 |                                        ^
 |                                        |
 +----------------------------------------+

Not being too familiar with the implementation of aeneas, I tried testing what would happen if I passed a completely erroneous transcription, but I didn't see an error in the output or anything in the resulting alignment that would help me detect automatically that the transcription was really bad.

After having read how the underlying algorithm works, I suspect this is because the alignment is bounded to a small region along the diagonal of the cost matrix, so even a completely erroneous transcription would result in an alignment that appears reasonable (at least until a human has a look and realizes the transcription is totally wrong).

I was wondering if there's any simple way to modify the algorithm to detect this case? I suspect that it might be possible if we somehow quantified how often the alignment happens on the "fringe" of the diagonal's margin, but I'm not sufficiently familiar with DTW to know if this would actually be a good idea.

Your guidance and help is much appreciated.