Closed milahu closed 6 months ago
I think this is too user-hostile for the use-cases where you actually want to modify the subtitle texts. I would like to get the same effect with Unicode surrogate escapes, which should improve the situation ie. prevent some hard errors when reading subtitle files that happen today, without burdening everyone with bytes
all over the place.
mixed-encoding files are a rare problem for example caused by inserting utf8 advertisments into non-utf8 subtitles
handling mixed-encoding files requires to detect boundaries between different encodings and for subtitles, these boundaries are the boundaries between text blocks
returning bytes instead of str could be a fallback when the file encoding was not autodetected (#43) or when the file has multiple encodings
but pysubs could still return strings with a more fine-grained encoding detection (guess encoding per textblock, not per file)
strings would be more user-friendly but less performant for my app, im ignoring text-encoding completely with the tradeoff of more complexity in other parts of my code (bytes regex)
Unicode surrogate escapes
that just ignores the problem of mixed-encoding files
the output will also be a mixed-encoding file but maybe the user wants a single-encoding file
>>> "ö".encode("latin1")
b'\xf6'
>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape")
'\udcf6'
>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape").encode("utf8", errors="surrogateescape")
b'\xf6'
>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed
I hope that mixed-encoding files are rare, since they sound like hell to deal with :) I'd like if it was possible to use the library to parse them, at least in principle, but I'm not sure how much support the library itself should provide. I think being able to get the raw bytes of the individual subtitle texts is enough - this is now possible with the latest version of my code, the user can .encode("utf8", errors="surrogateescape")
the individual subtitles and then re-interpret them with a different encoding.
The performance aspect is something I haven't really thought about, and isn't my goal for the library. Your bytes
-oriented approach will surely be more efficient. But at that point, using a C++/Rust/etc. library would surely be order of magnitude faster still. (Though, I don't know what libraries are available. I've thought about porting this library to Rust as a learning excercise, but I really don't have any use for it, so I haven't ^^;)
>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed
Thanks for pointing this error, that's a bit nasty. I should note this in the documentation, if the next release is going to have surrogate escapes as default.
fix #43
simply ignore text encoding, and use raw bytestrings dealing with text encoding is deferred to the user
this allows handling "broken" files with multiple encodings
probably this is too much change, so i merged this into pysubs2bytes