kaegi / alass

"Automatic Language-Agnostic Subtitle Synchronization"
GNU General Public License v3.0
994 stars 52 forks source link

Feature request: allow different subtitles charsets (other than UTF8) to be processed #25

Closed wellloaded closed 3 years ago

wellloaded commented 3 years ago

I'm having some issues and have to use iconv to force a UTF-8 conversion of my .rtf files as alass would not process them otherwise.

Can different charsets be considered at alass code level?

Thanks

kaegi commented 3 years ago

Have you tried --encoding-inc and --encoding-ref?

wellloaded commented 3 years ago

Have you tried --encoding-inc and --encoding-ref?

Actually I didn't but it seems like the user is expected to know the charset of the incorrect subtitles file (or the reference one)

It is nice to see this was "considered" still it needs manual interaction which is what I was trying to avoid. I'm wondering if the charset can be worked out automatically by alass directly (e.g. like the file command does)

Thanks!


Edit:

Video=$(ls -1 | grep -Ei '*.avi$|*.mkv$|.*asf$|*.wmv$|*.mp4$|*.mpg$|*.mpeg$|*.divx$|*.m4v$' 2>/dev/null)
SubName=$(ls -1 | grep -Ei '*.srt$' 2>/dev/null | head -1)
CharSet=$(file -bi ./"$SubName" | cut -f2 -d "=")
SubNameResync=".alass"_$SubName
alass --encoding-inc "$CharSet" "$Video" "$SubName" "$SubNameResync"

Running the above script recursively on my video folders does the job OKish; I still think this could be better handled by alass internally with a charset autodetection routine.

P.S. file occasionally returns an "unknown-8bit" which alass doesn't understand as CharSet input.

wellloaded commented 3 years ago

As a matter of facts I have found that file -bi is NOT reliable enough. So anybody facing this problem I would strongly suggest forcing a UTF-8 of the source .srt like this (you'll need vim installed):

vim +'set nobomb | set fenc=utf8 | x' <filename>

The above will open any CharSet and save in utf-8 transparently.

So the above script is further developed into:

Video=$(ls -1 | grep -Ei '*.avi$|*.mkv$|.*asf$|*.wmv$|*.mp4$|*.mpg$|*.mpeg$|*.divx$|*.m4v$' 2>/dev/null)
SubName=$(ls -1 | grep -Ei '*.srt$' 2>/dev/null | head -1)
vim +'set nobomb | set fenc=utf8 | x' $SubName
CharSet=$(file -bi ./"$SubName" | cut -f2 -d "=")
SubNameResync=".alass"_$SubName
alass --encoding-inc "$CharSet" "$Video" "$SubName" "$SubNameResync"

HTH

kaegi commented 3 years ago

Auto-detection of character encoding using https://github.com/thuleqaid/rust-chardet implemented in https://github.com/kaegi/alass/commit/874f02d9577182752a0f969b6d6b98fd65bdf1fc.