srvk / eesen-transcriber

EESEN based offline transcriber VM using models trained on TEDLIUM and Cantab Research
Apache License 2.0
49 stars 14 forks source link

srt file generation #20

Closed madhavsund closed 5 years ago

madhavsund commented 7 years ago

I tried for my own acoustic model. When I run vids2web.sh the recognized content is correct but the srt file generated is having some time difference and some content misplacement

trans file content

<Sync` time="0.01"/>
pustakam' vaang'ng'i nj'aayar'aazcha raatri jammuvile do'da udham'pur~ jillakal'il~ bhiikarar~ cheytat
</Turn>

<Turn speaker="S2" startTime="7.06" endTime="10.59">
<Sync time="7.06"/>
avan~ malayaal'i tanneyaa ning'ng'al'~
</Turn>
<Turn speaker="S3" startTime="10.59" endTime="20.65">
<Sync time="10.59"/>
po'yatil~pinne ivit'e onnum' sam'bhavichchilla avan~ ur'akke karanj'nj'u enikku kur'achchu vel'l'am' taruu enikku naal'e varaan~ kaziyilla un't'
</Turn>

srt file content

1
00:00:00.020 --> 00:00:07.070
pustakam' vaang'ng'i nj'aayar'aazcha raatri
2
00:00:07.070 --> 00:00:10.590
avan~ jammuvile
3
00:00:10.590 --> 00:00:20.660
po'yatil pinne do'da malayaal'i tanneyaa ivit'e udham'pur~ jillakal'il~ onnum' sam'bhavichchilla ning'ng'al'~ bhiikarar~ cheytat

`` attached the files for reference

anjali.srt.txt anjali.trs.txt anjali.txt

what could be the reason

riebling commented 7 years ago

This is probably due to the segmentation performed by the LIUM diarization system. It is by no means perfect, and in fact, the models it uses were trained on French broadcast news. So it's a small miracle that it works for other languages :-) One thing you could try to change (I won't go as far as to say improve) segmentation results would be to specify a different segmentation strategy in Makefile.options.

We prefer to use the show.s.seg strategy which produces the largest number of smallest segments. But the default setting show.seg does more work (takes longer) to cluster utterances by speaker and join together adjacent utterances.

Beyond this, a best approach is if you have a way to provide your own segmentation, but this is usually unknown. If you have segments in STM format, you can use them with the control script run-segmented.sh

We are working on improving segmentation with systems other than LIUM, and will announce when something comes available that does a better job. (Although most likely it will be tuned for English)

madhavsund commented 7 years ago

my target is just recognize the audio content and generate sub title for video. so, whether the speaker diarization can be skipped

riebling commented 7 years ago

The shared references seem interesting, in that the transcriptions are completely different. I'm not sure how the system is even capable of doing that, unless maybe two files of the same name were transcribed, or one file was transcribed partially, had errors, then transcribed again, such that different partial result data files were left behind.

In any case there is definitely a source of some rounding errors going from 3 decimal place precision of .hyp (hypothesis) files to 2 decimal places in the .trs (transcription) file, created by the Python program scripts/hyp2trs.py. If you find out more details, please share back here to help figure this out.

riebling commented 7 years ago

Oops I meant to say, to generate automatic subtitles, you will still need to break the audio into utterances somehow, since you want the subtitles to appear in a time ordered manner, with gaps where the text on screen remains for a time, then goes away, and is replaced by new text. This requires segmentation, but not diarization. Otherwise you'll end up with one very long screen filled with the entire text of the video, not desirable! :)