Closed madhavsund closed 5 years ago
This is probably due to the segmentation performed by the LIUM diarization system. It is by no means perfect, and in fact, the models it uses were trained on French broadcast news. So it's a small miracle that it works for other languages :-) One thing you could try to change (I won't go as far as to say improve) segmentation results would be to specify a different segmentation strategy in Makefile.options.
We prefer to use the show.s.seg
strategy which produces the largest number of smallest segments. But the default setting show.seg
does more work (takes longer) to cluster utterances by speaker and join together adjacent utterances.
Beyond this, a best approach is if you have a way to provide your own segmentation, but this is usually unknown. If you have segments in STM format, you can use them with the control script run-segmented.sh
We are working on improving segmentation with systems other than LIUM, and will announce when something comes available that does a better job. (Although most likely it will be tuned for English)
my target is just recognize the audio content and generate sub title for video. so, whether the speaker diarization can be skipped
The shared references seem interesting, in that the transcriptions are completely different. I'm not sure how the system is even capable of doing that, unless maybe two files of the same name were transcribed, or one file was transcribed partially, had errors, then transcribed again, such that different partial result data files were left behind.
In any case there is definitely a source of some rounding errors going from 3 decimal place precision of .hyp
(hypothesis) files to 2 decimal places in the .trs
(transcription) file, created by the Python program scripts/hyp2trs.py
. If you find out more details, please share back here to help figure this out.
Oops I meant to say, to generate automatic subtitles, you will still need to break the audio into utterances somehow, since you want the subtitles to appear in a time ordered manner, with gaps where the text on screen remains for a time, then goes away, and is replaced by new text. This requires segmentation, but not diarization. Otherwise you'll end up with one very long screen filled with the entire text of the video, not desirable! :)
I tried for my own acoustic model. When I run vids2web.sh the recognized content is correct but the srt file generated is having some time difference and some content misplacement
trans file content
srt file content
`` attached the files for reference
anjali.srt.txt anjali.trs.txt anjali.txt
what could be the reason