patrickschu / textgrid-convert

textgrid-convert converts audio transcripts such as sbv or srt files to Praat and DARLA compatible TextGrids.
9 stars 1 forks source link

error in reading multi-line subtitle #43

Closed jhdeov closed 2 years ago

jhdeov commented 2 years ago

If the SRT has the prose be multiple lines, the SRTparser fails and misinterprets the time stamp. For example, an SRT with:

139
00:07:06,467 --> 00:07:08,847
After that, I studied Armenian Studies .

140
00:07:08,847 --> 00:07:12,957
In university? 
Hamazkayin had opened

141
00:07:12,957 --> 00:07:16,827
their own Armenian Studies department.

The extra line "Hamazkayin" Causes the parser to read the number 141 as a timestamp.

start, end = timestamps.split(time_stamp_sep) ValueError: not enough values to unpack (expected 2, got 1)

I have to manually change the newlines into a single line.

patrickschu commented 2 years ago

Thanks @jhdeov for filing the detailed report

jhdeov commented 2 years ago

There's this other python script that doesn't have this multi-line problem. I don't know if you can 'copy' parts of their code for this error. For the file, the excerpt was from this public SRT.

patrickschu commented 2 years ago

Issue is with parse_transcription()function in textgrid_convert/srtParser.py:29.

patrickschu commented 2 years ago

Hi @jhdeov I fixed this issue, see PR #46

Can you try pip install textgrid-convert==0.4.3 and see if things work now? Thanks!

jhdeov commented 2 years ago

@patrickschu I have general pip problems so I relied on just the github source code. It looks like release 4.3 isn't online yet? But if the files/snippets from the above comments work on your side, then I believe it :)

patrickschu commented 2 years ago

Hi @jhdeov Ah I see - yes I used your sample files in the unit tests