SRTParseError: Expected contiguous start of match or end of input

UrosOgrizovic commented 3 years ago

Environment (please complete the following information):

OS: Windows 10
python version 3.8.3
subsync version 0.4.11

Describe the bug I'm aware that an issue about this same problem has already been opened (and closed), but I can't seem to solve it.

I have some subtitles like so

1
00:00:01 --> 00:00:06,250000
hello watching it yesterday and I will always different Partners and I don't know any

2
00:00:06,250000 --> 00:00:11,500000
person alive but it doesn't horrible's right I mean about the terminology when someone's greatest

...

and I'm trying to synchronize them to a video. The problem I'm getting is in srt.py -> _check_contiguity(), which is being called on line 350 of that same file, inside parse(). Essentially, expected_start != actual_start is the problem. expected_start is 0, whereas actual_start is 118. In other words, SRT_REGEX.finditer(srt) skips the first subtitle block, causing _check_contiguity() to raise an SRTParseError. Why is that?

To Reproduce

Copy the following code into a Python file:

import re

RGX_TIMESTAMP_MAGNITUDE_DELIM = r"[,.:，．。：]"
RGX_TIMESTAMP_FIELD = r"[0-9]+"
RGX_TIMESTAMP = RGX_TIMESTAMP_MAGNITUDE_DELIM.join([RGX_TIMESTAMP_FIELD] * 4)
RGX_TIMESTAMP_PARSEABLE = r"^{}$".format(
    RGX_TIMESTAMP_MAGNITUDE_DELIM.join(["(" + RGX_TIMESTAMP_FIELD + ")"] * 4)
)
RGX_INDEX = r"-?[0-9]+\.?[0-9]*"
RGX_PROPRIETARY = r"[^\r\n]*"
RGX_CONTENT = r".*?"
RGX_POSSIBLE_CRLF = r"\r?\n"

TS_REGEX = re.compile(RGX_TIMESTAMP_PARSEABLE)
MULTI_WS_REGEX = re.compile(r"\n\n+")
SRT_REGEX = re.compile(
    r"\s*({idx})\s*{eof}({ts}) *-[ -] *> *({ts}) ?({proprietary})(?:{eof}|\Z)({content})"
    r"(?:{eof}|\Z)(?:{eof}|\Z|(?=(?:{idx}\s*{eof}{ts})))"
    r"(?=(?:{idx}\s*{eof}{ts}|\Z))".format(
        idx=RGX_INDEX,
        ts=RGX_TIMESTAMP,
        proprietary=RGX_PROPRIETARY,
        content=RGX_CONTENT,
        eof=RGX_POSSIBLE_CRLF,
    ),
    re.DOTALL,
)

srt = '''1
00:00:01 --> 00:00:06,250000
hello watching it yesterday and I will always different Partners and I don't know any

2
00:00:06,250000 --> 00:00:11,500000
person alive but it doesn't horrible's right I mean about the terminology when someone's greatest'''

for match in SRT_REGEX.finditer(srt):
     print(match)

Expected behavior The code given above prints <re.Match object; span=(116, 253), match="\n\n2\n00:00:06,250000 --> 00:00:11,500000\nperso>, i.e. it skips the first subtitle block, which is not what I want.

Output Here's the output of running ffs video.mp4 -i unsynchronized.srt -o synchronized.srt

File "...\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File ...\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "...\venv\Scripts\ffs.exe\__main__.py", line 9, in <module>
  File "...\venv\lib\site-packages\ffsubsync\ffsubsync.py", line 441, in main
    return run(args)['retval']
  File "...\venv\lib\site-packages\ffsubsync\ffsubsync.py", line 338, in run
    sync_was_successful = try_sync(args, reference_pipe, result)
  File "...\venv\lib\site-packages\ffsubsync\ffsubsync.py", line 167, in try_sync
    raise exc
  File "...\venv\lib\site-packages\ffsubsync\ffsubsync.py", line 118, in try_sync
    srt_pipe.fit(srtin)
  File "...\venv\lib\site-packages\ffsubsync\sklearn_shim.py", line 212, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "...\venv\lib\site-packages\ffsubsync\sklearn_shim.py", line 175, in _fit
    X, fitted_transformer = _fit_transform_one(
  File "...\venv\lib\site-packages\ffsubsync\sklearn_shim.py", line 368, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "...\venv\lib\site-packages\ffsubsync\sklearn_shim.py", line 40, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "...\venv\lib\site-packages\ffsubsync\subtitle_parser.py", line 114, in fit
    raise exc
  File "...\venv\lib\site-packages\ffsubsync\subtitle_parser.py", line 99, in fit
    _preprocess_subs(parsed_subs,
  File "...\venv\lib\site-packages\ffsubsync\subtitle_parser.py", line 47, in _preprocess_subs
    next_sub = GenericSubtitle.wrap_inner_subtitle(next(subs))
  File "...\venv\lib\site-packages\srt.py", line 350, in parse
    _check_contiguity(srt, expected_start, actual_start, ignore_errors)
  File "...\venv\lib\site-packages\srt.py", line 408, in _check_contiguity
    raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 0, but started at char 118 (unmatched content: "1\r\n00:00:01 --> 00:00:06,250000\r\nhello watching it yesterday and I will always different Partners and I don't know any")

smacke commented 3 years ago

Thanks for filing an issue! This looks like it may benefit from the srt maintainer's eyes, but in my tests, it seems like the parse error comes from the timestamp 00:00:01 without the extra decimal part. I think this probably is because the timestamp regular expression expects 4 "fields":

RGX_TIMESTAMP = RGX_TIMESTAMP_MAGNITUDE_DELIM.join([RGX_TIMESTAMP_FIELD] * 4)

I can work on getting a PR out later, but for now, one workaround would be to manually add those fields or to use a sed command to add them back in automatically, something like:

sed -E 's/([0-9]+:[0-9]+:[0-9]+) /\1,0/g' bad.srt > good.srt

The file parses after running it through this transformation.

cdown commented 3 years ago

Just saw this and merged https://github.com/cdown/srt/commit/8eb45f98e8497b230d9264661ca2f3829504ee69. :-)

smacke / ffsubsync

SRTParseError: Expected contiguous start of match or end of input #123