Open woojayjeon opened 7 months ago
In the kaldi format transcription files, it seems that sometimes there are insertions of numbers that are not present in the audio. For example:
medium/2149/mark_wnt_mp_0808_librivox_64kb_mp3/mark_2_weymouth_64kb_39 008 028 JOHN THE BAPTIST THEY REPLIED BUT OTHERS SAY ELIJAH AND OTHERS THAT IT IS ONE OF THE PROPHETS 008 029 THEN HE ASKED THEM POINTEDLY BUT YOU YOURSELVES WHO DO YOU SAY THAT I AM
Above,
008 028
and008 029
seem to be Bible verse numbers that are in the original text but not actually read. I have verified this by listening to the audio sample.
Em... possible. will see if there is any bug in the alignment tools.
They are all over the place, i think this is a page numbers or something like this, sometimes it is even roman numbers, sometimes it is in brackets, sometimes it is just as is.
Emm,We filter the segments according to the levenshtien distance between original text and transcript text, when the segment is long, this kind of insertions may not affect the whole distance, I mean the distance is sitll below the given threshold. Currently, have not figured out how to fix this bug.
In the kaldi format transcription files, it seems that sometimes there are insertions of numbers that are not present in the audio. For example:
Above,
008 028
and008 029
seem to be Bible verse numbers that are in the original text but not actually read. I have verified this by listening to the audio sample.