use bytestrings - Githubissues

milahu commented 8 months ago

fix #43

simply ignore text encoding, and use raw bytestrings dealing with text encoding is deferred to the user

this allows handling "broken" files with multiple encodings

probably this is too much change, so i merged this into pysubs2bytes

import pysubs2

subfile_path = "test.srt"

ssa_event_list = pysubs2.load(subfile_path)

for ssa_event in ssa_event_list:
    text = ssa_event.text
    if type(text) == bytes:
        text = text.decode("utf8")
    print("text", repr(text))

tkarabela commented 6 months ago

I think this is too user-hostile for the use-cases where you actually want to modify the subtitle texts. I would like to get the same effect with Unicode surrogate escapes, which should improve the situation ie. prevent some hard errors when reading subtitle files that happen today, without burdening everyone with bytes all over the place.

milahu commented 6 months ago

mixed-encoding files are a rare problem for example caused by inserting utf8 advertisments into non-utf8 subtitles

handling mixed-encoding files requires to detect boundaries between different encodings and for subtitles, these boundaries are the boundaries between text blocks

returning bytes instead of str could be a fallback when the file encoding was not autodetected (#43) or when the file has multiple encodings

but pysubs could still return strings with a more fine-grained encoding detection (guess encoding per textblock, not per file)

strings would be more user-friendly but less performant for my app, im ignoring text-encoding completely with the tradeoff of more complexity in other parts of my code (bytes regex)

Unicode surrogate escapes

that just ignores the problem of mixed-encoding files

the output will also be a mixed-encoding file but maybe the user wants a single-encoding file

>>> "ö".encode("latin1")
b'\xf6'

>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape")
'\udcf6'

>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape").encode("utf8", errors="surrogateescape")
b'\xf6'

>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

tkarabela commented 6 months ago

I hope that mixed-encoding files are rare, since they sound like hell to deal with :) I'd like if it was possible to use the library to parse them, at least in principle, but I'm not sure how much support the library itself should provide. I think being able to get the raw bytes of the individual subtitle texts is enough - this is now possible with the latest version of my code, the user can .encode("utf8", errors="surrogateescape") the individual subtitles and then re-interpret them with a different encoding.

The performance aspect is something I haven't really thought about, and isn't my goal for the library. Your bytes-oriented approach will surely be more efficient. But at that point, using a C++/Rust/etc. library would surely be order of magnitude faster still. (Though, I don't know what libraries are available. I've thought about porting this library to Rust as a learning excercise, but I really don't have any use for it, so I haven't ^^;)

>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

Thanks for pointing this error, that's a bit nasty. I should note this in the documentation, if the next release is going to have surrogate escapes as default.

tkarabela / pysubs2

use bytestrings #84