ukdtom / SRT2UTF-8.bundle

Plex Agent, that'll convert sidecar subtitle files into UTF-8, if not
125 stars 14 forks source link

Serbian subtitle means they can be mostly Latin and some in Cyrillic #45

Closed loa11 closed 5 years ago

loa11 commented 5 years ago

Serbian subtitle means they can be mostly Latin and some in Cyrillic. Serbian subtitles are not just Cyrillic. If I add .SR. to add info that subtitle is Serbian to Plex, SRT2UTF-8 will broke ansi parts when converting because you set it to think that all "SR" are Cyrillic subtitles. After the conversion subtitle got mixed with broken Cyrillic parts and rest in the Latin. I don't know why some people still use ansi seems they are using some old version of subtitle workshop which is only ansi.

Now I need to change every downloaded subtitle from "SR" Serbian to "HR" to Croatian in order to prevent SRT2UTF-8 plugin to break all the subtitles Latin subtitles. I don't know how to overcome the problem should I remove serbian.edm or should I make copy of croatian.edm to serbian.edm. But if I make it like that, I would lose Cyrillic support when there is Cyrillic type of Serbian subtitle.

Is it possible that you make detector in serbian.edm that it detect is the subtitle Cyrillic or Latin. Like if the Latin letters are detected and if srt contain more than 65% of Latin letters presume its completely Latin.

ukdtom commented 5 years ago

No idea what you mean by croatian.edmor serbian.edm?

But sadly, I've no way of digesting the individual letters in a sub, and leave that to beatifull soap plugin

When said, I default to codepage iso-8859-2 when .sr is part of the filename, which is latin, so very strange that you are seen this

https://github.com/ukdtom/SRT2UTF-8.bundle/blob/master/Contents/Code/CP_Windows_ISO.py#L160

loa11 commented 5 years ago

Here is the example that with cp converter tool, the each of those are converted without any problems https://ibb.co/3k4HRsJ

The main problem is the initial detection because srt needs to be detected which coding is in use. In this case its "Central European (windows) 1250" the most Serbian Latin srt "non-utf-8" subs are ANSI "Central European (windows) 1250" but in same rare occasions there are some "Eastern European ISO8859-2" coding.

I would like that we have manual mode before auto mode, because after SRT2UTF-8 is done, srt is irreversible. Could you set for now that non utf-8 "SR" is presumably the ANSI "Central European (windows) 1250" before conversion? I also saw that Subtitle Edit also lack proper detection and its also erroneously converting the SR srt subs because it cant auto detect ANSI "Central European (windows) 1250" before conversion.

Serbian_latin

ukdtom commented 5 years ago

Sorry, but nope....

Detection is a guess, and since so many people are depending on it now, I simply don't dare changing the algorithm

loa11 commented 5 years ago

Could you make a beta version just to try, it would not be the main version. I could test it. As I know everyone from the Balkans on the forums are speaking that SRT2UTF-8 is broke very rarely is someone using it on the Plex. Could you please do me a favor, thanks in advanced.

ukdtom commented 5 years ago

No promise here, and need a zip with two srt files in cyrillic, as well as two srt files in latin, and a txt file in english telling which is which....And all in one single zip

Also note, that it's kinda up hill for me here, since haven't touched this plugin for an year, and doesn't know serbian language ;)

loa11 commented 5 years ago

Thanks could you give me few days to get very rare "Eastern European ISO8859-2" subs.

ukdtom commented 5 years ago

Ping

loa11 commented 5 years ago

I will start one by one I still couldn't get the "Eastern European ISO8859-2" encoded subs. I will add for the beginning the most common Ansi Windows 1252 encoded in CentralEuropean Windows encoding also I will still show you with the pictures what could wrong encoding do to the sub. https://ufile.io/c9noarbc

Here are the images http://img18127.imagevenue.com/img.php?image=986305883_ProperencodingtoAnsi1252CentralEuropeanWindows_122_37lo.jpg

http://img18127.imagevenue.com/img.php?image=986307280_WrongencodingtoCentralEuropeanISO_122_1176lo.jpg

http://img18116.imagevenue.com/img.php?image=986308728_WrongencodingtoCyrillicWindows_122_346lo.jpg

ukdtom commented 5 years ago

Closing this due to lack of test files Can be reopened if needed, when files has been provided