k2-fsa / libriheavy

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
Apache License 2.0
172 stars 10 forks source link

Insertion of numbers in transcriptions #7

Open woojayjeon opened 7 months ago

woojayjeon commented 7 months ago

In the kaldi format transcription files, it seems that sometimes there are insertions of numbers that are not present in the audio. For example:

medium/2149/mark_wnt_mp_0808_librivox_64kb_mp3/mark_2_weymouth_64kb_39 
008 028 JOHN THE BAPTIST THEY REPLIED BUT OTHERS SAY ELIJAH AND OTHERS THAT IT IS 
ONE OF THE PROPHETS 008 029 THEN HE ASKED THEM POINTEDLY BUT YOU YOURSELVES 
WHO DO YOU SAY THAT I AM

Above, 008 028 and 008 029 seem to be Bible verse numbers that are in the original text but not actually read. I have verified this by listening to the audio sample.

pkufool commented 6 months ago

In the kaldi format transcription files, it seems that sometimes there are insertions of numbers that are not present in the audio. For example:

medium/2149/mark_wnt_mp_0808_librivox_64kb_mp3/mark_2_weymouth_64kb_39 
008 028 JOHN THE BAPTIST THEY REPLIED BUT OTHERS SAY ELIJAH AND OTHERS THAT IT IS 
ONE OF THE PROPHETS 008 029 THEN HE ASKED THEM POINTEDLY BUT YOU YOURSELVES 
WHO DO YOU SAY THAT I AM

Above, 008 028 and 008 029 seem to be Bible verse numbers that are in the original text but not actually read. I have verified this by listening to the audio sample.

Em... possible. will see if there is any bug in the alignment tools.

ex3ndr commented 2 months ago

They are all over the place, i think this is a page numbers or something like this, sometimes it is even roman numbers, sometimes it is in brackets, sometimes it is just as is.

pkufool commented 2 months ago

Emm,We filter the segments according to the levenshtien distance between original text and transcript text, when the segment is long, this kind of insertions may not affect the whole distance, I mean the distance is sitll below the given threshold. Currently, have not figured out how to fix this bug.