I'm exploring possibilities on how to gauge whether a transcription algorithm did a good job when we have no supervision available (i.e., no annotated dataset).
It occurred to me that perhaps one way to do this would be to compute some kind of reconstruction score on the audio domain (when doing the forced alignment):
Not being too familiar with the implementation of aeneas, I tried testing what would happen if I passed a completely erroneous transcription, but I didn't see an error in the output or anything in the resulting alignment that would help me detect automatically that the transcription was really bad.
After having read how the underlying algorithm works, I suspect this is because the alignment is bounded to a small region along the diagonal of the cost matrix, so even a completely erroneous transcription would result in an alignment that appears reasonable (at least until a human has a look and realizes the transcription is totally wrong).
I was wondering if there's any simple way to modify the algorithm to detect this case? I suspect that it might be possible if we somehow quantified how often the alignment happens on the "fringe" of the diagonal's margin, but I'm not sufficiently familiar with DTW to know if this would actually be a good idea.
I'm exploring possibilities on how to gauge whether a transcription algorithm did a good job when we have no supervision available (i.e., no annotated dataset).
It occurred to me that perhaps one way to do this would be to compute some kind of reconstruction score on the audio domain (when doing the forced alignment):
Not being too familiar with the implementation of
aeneas
, I tried testing what would happen if I passed a completely erroneous transcription, but I didn't see an error in the output or anything in the resulting alignment that would help me detect automatically that the transcription was really bad.After having read how the underlying algorithm works, I suspect this is because the alignment is bounded to a small region along the diagonal of the cost matrix, so even a completely erroneous transcription would result in an alignment that appears reasonable (at least until a human has a look and realizes the transcription is totally wrong).
I was wondering if there's any simple way to modify the algorithm to detect this case? I suspect that it might be possible if we somehow quantified how often the alignment happens on the "fringe" of the diagonal's margin, but I'm not sufficiently familiar with DTW to know if this would actually be a good idea.
Your guidance and help is much appreciated.