pymedusa / Medusa

Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic.
https://pymedusa.com
GNU General Public License v3.0
1.8k stars 276 forks source link

Subtitles languages not always recognised in ISO 639-2 3 character #8502

Closed Rouzax closed 4 years ago

Rouzax commented 4 years ago

Describe the bug Most episodes these days have embedded subtitles which I extract as srt files but the language codes for these is most of the times in ISO 639-2 3 character. For English this works perfect. Files image

Medusa image

But as you can see the Dutch subtitle is not recognized while files that are formated with the 2 character language code are picked up. Files image

Medusa image

Medusa (please complete the following information):

Medusa Info: Branch: master Commit: b352bb6924afcdfafce176a540d53ce405ca1312 Version: 0.4.3 Database: 44.16
Python Version: 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)]
SSL Version: OpenSSL 1.1.1d 10 Sep 2019
OS: Windows-10-10.0.17763-SP0
Locale: nl_NL.cp1252
BenjV commented 4 years ago

ISO 639-2 defines two different codes for the Dutch language. The formal code is "nld" And "dut" is just a synoniem which is not recognized.

Rouzax commented 4 years ago

I understand but all embedded subs use the dut format to my knowledge

BenjV commented 4 years ago

To be precise, these are not embedded, a embedded subtile is when the sub in embedded within the container so within the . mkv of .mp4.

Rouzax commented 4 years ago

Sorry for the confusion, in my case they are the embedded subs but I extract and process them with Subtitle Edit before Medusa imports the episode. So they end up as external but the naming comes from the embedded subs.

Hope I make sense 😀

BenjV commented 4 years ago

No, embedded subs are only a stream. The name is created by the software you use to extract it. Why do you extract it? That seems useless to me.

Rouzax commented 4 years ago

Reason for doing so is that I parse them with https://github.com/SubtitleEdit/subtitleedit to remove all Hearing Impaired entries

BenjV commented 4 years ago

So if I understand correctly not extracting the subs is the goal but removing the Hearing Impaired subs from the source is the goal? Then removing them should be sufficient without subtracting all subs, or am I missing something?

And the standard for subtitle filename extensions is not using the tree letter Iso code (e.g. .eng and .dut) but the tow letter Iso code (e.g. .nl and .en) I don't know subtitle edit but maybe you can change to the two letter code.

If not I can make a small python script for you to extract the subtitles with a two code as extension and deleting the one with hearing impaired.

Rouzax commented 4 years ago

My flow is as follows.

  1. Download finishes
  2. Script is kicked off
  3. Files are copied to a staging area and unpacked
  4. I then run https://github.com/willforde/mkvstrip to strip out all embedded subs that are not EN or NL (I hate 200 different subs in a file 😄)
  5. I run Subtitle Edit on the MKV which will extract the embed subs still present and fix subtitles by removing Hearing Impaired, fix common errors, etc
  6. Subtitle Edit will indeed extract using the 3-letter country code.
  7. When all is finished I call Medusa API to start the import.

My goal is to have Medusa recognize the Dutch subs when importing. So if you have a python script that will convert 3-letter country codes to 2-letter country codes or have Medusa understand dut I'm a happy camper 😄

BenjV commented 4 years ago

Not only Medusa does not support the 3 letter code, also all media players expect the two letter code as extension of subtitles files.

Still not sure why you don't keep the subs in the .mkv I can make you a script dat does this:

Input: Video file with all kinds of subs in there Output: Video file with only Dutch and English subs in there (without hearing Impaired subs)

or:

Input: Video file with all kinds of subs in there Output: Dutch and English subtitle files with two letter extension and skipping the hearing Impaired subs.

or: Input: Video file with all kinds of subs in there Output: Dutch and English subtitle files with two letter extension and skipping the hearing Impaired subs and a new Video File without subtitles.

or: A script that renames subtitles files with 3 lettercode to subtitles files with two lettercode.

The last is the simplest option to make, you can even use a .bat script to do that. something like:

rename *dut.srt *nl.srt
rename *eng.srt *en.srt
p0psicles commented 4 years ago

Medusa is nog going to support that. We use libs that parse the language code. So we would need to make exceptions in python libs and js libs?

Rouzax commented 4 years ago

Batch doesn't like that 😉, I've tried Will end up with test.dut.nl.srt I extract them because I want to edit the subs and strip out unwanted HI and other things like song lyrics etc

BenjV commented 4 years ago

You must not use . in front of the dut. The point (.) is greedy.

So this is not working. rename *.dut.srt *.nl.srt

but this works rename *dut.srt *nl.srt

Rouzax commented 4 years ago

You must not use . in front of the dut. The point (.) is greedy.

So this is not working. rename *.dut.srt *.nl.srt

but this works rename *dut.srt *nl.srt

For me with Windows 10 rename *dut.srt *nl.srt will give me test.dut.srtnl.srt

BenjV commented 4 years ago

OK try this command:

rename ???????????????????????????????????????????????.dut.srt ???????????????????????????????????????????????.nl.srt

Make sure to use enough question marks (?) to catch even the longest name. Too much question mark does not matter too little you miss files with longer names.

Rouzax commented 4 years ago

This is some high-tech scripting and it works 😃

BenjV commented 4 years ago

Not high tech just avoiding some Microsoft stupid rename wildcard implementations.

Rouzax commented 4 years ago

To comment on

Not only Medusa does not support the 3 letter code, also all media players expect the two letter code as extension of subtitles files.

Kodi works perfectly with the .dut.srt files but I'll add in your rename command to fix it. Thanks for your help.

Rouzax commented 4 years ago

@BenjV It did work in my test but it fails in production, it looks to be because there are multiple dots It appears that Batch is very picky https://superuser.com/questions/475874/how-does-the-windows-rename-command-interpret-wildcards

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>dir
 Volume in drive C has no label.
 Volume Serial Number is D88D-6860

 Directory of C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES

09-10-2020  09:30    <DIR>          .
09-10-2020  09:30    <DIR>          ..
09-10-2020  09:30    <DIR>          Sample
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.#4.eng.srt
09-10-2020  09:30            61.729 the.boys.s02e08.1080p.web.h264-cakes.dut.srt
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.eng.srt
09-10-2020  07:12     4.325.977.430 the.boys.s02e08.1080p.web.h264-cakes.mkv
09-10-2020  07:12               254 the.boys.s02e08.1080p.web.h264-cakes.nfo
09-10-2020  07:12             2.922 the.boys.s02e08.1080p.web.h264-cakes.srr
               6 File(s)  4.326.184.059 bytes
               3 Dir(s)  4.795.503.951.872 bytes free

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>ren ???????????????????????????????????????????????.dut.srt ???????????????????????????????????????????????.nl.srt
The system cannot find the file specified.

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>dir
 Volume in drive C has no label.
 Volume Serial Number is D88D-6860

 Directory of C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES

09-10-2020  09:30    <DIR>          .
09-10-2020  09:30    <DIR>          ..
09-10-2020  09:30    <DIR>          Sample
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.#4.eng.srt
09-10-2020  09:30            61.729 the.boys.s02e08.1080p.web.h264-cakes.dut.srt
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.eng.srt
09-10-2020  07:12     4.325.977.430 the.boys.s02e08.1080p.web.h264-cakes.mkv
09-10-2020  07:12               254 the.boys.s02e08.1080p.web.h264-cakes.nfo
09-10-2020  07:12             2.922 the.boys.s02e08.1080p.web.h264-cakes.srr
               6 File(s)  4.326.184.059 bytes
               3 Dir(s)  4.795.627.683.840 bytes free

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>
BenjV commented 4 years ago

OK, I can make a small python script that does the renaming for you. How do you want it to function?

  1. Rename all files in the current directory.
  2. Rename all files and get the directory via a commandline parameter
  3. Rename a specific file via a parameter on the commandline
  4. Something else
Rouzax commented 4 years ago

Thank you very much for that offer but I figured it out by using Bulk Rename CLI https://www.bulkrenameutility.co.uk/Download.php#DownloadBulkRenameCommand

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>%brc64% /DIR:"C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES" /PATTERN:"*.srt" /REPLACECI:.dut:.nl /REPLACECI:.eng:.en

Processing Folder C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES\
Filename the.boys.s02e08.1080p.web.h264-cakes.#4.eng.srt would be renamed to the.boys.s02e08.1080p.web.h264-cakes.#4.en.srt
Filename the.boys.s02e08.1080p.web.h264-cakes.dut.srt would be renamed to the.boys.s02e08.1080p.web.h264-cakes.nl.srt
Filename the.boys.s02e08.1080p.web.h264-cakes.eng.srt would be renamed to the.boys.s02e08.1080p.web.h264-cakes.en.srt
BenjV commented 4 years ago

Ok, glad to be of a little assistance.

Rouzax commented 4 years ago

Really appreciate the offer!

Rouzax commented 4 years ago

@BenjV since you offered, would you be willing to take a look at https://github.com/jobrien2001/mkvstrip ? The python script uses mkvmerge to remove unwanted subtitle and audio languages that might be part of the mkv but the script is crashing more and more and the original author does not respond.

It seems to be related to character encoding in the subtitle names (I think) Here 2 json outputs of files that are crashing or not working. 1.txt 2.txt

p0psicles commented 4 years ago

And errors? Or trace back?

Rouzax commented 4 years ago

On the 1.json it just does nothing even with debug on in the script it will just stop.

Some of the errors I managed to "fix" by changing line 223 to

        process = subprocess.Popen(command, stdout=subprocess.PIPE, universal_newlines=True, encoding="utf8", errors='ignore')

For the second mkv

C:\TEMP\Torrent\PROCD>"C:\Python37\python.exe" %mkvstrip% -b %MKVMergeLocation% -v -l eng,dut -s eng,dut -r Forced C:\TEMP\Torrent\PROCD\TV\1
Searching for MKV files to process.
Warning: This may take some time...
Checking C:\TEMP\Torrent\PROCD\TV\1\Tehran.S01E01.Emergency.Landing.in.Tehran.1080p.ATVP.WEB-DL.DDP5.1.H.264-NTb.mkv

C:\TEMP\Torrent\PROCD>

Did a trace with python (first time for everything 😄 ) and it seems on the first file all subtitle languages are not recognised

 --- modulename: mkvstrip, funcname: __init__
mkvstrip.py(204):         self.lang = track_data["properties"].get("language", "und")
mkvstrip.py(205):         self.codec = track_data["codec"]
mkvstrip.py(206):         self.type = track_data["type"]
mkvstrip.py(207):         self.id = track_data["id"]
mkvstrip.py(208):         self.name = track_data["properties"].get("track_name")
mkvstrip.py(209):         self.forced = track_data["properties"].get("forced_track")
mkvstrip.py(243):             track_map[track_obj.type].append(track_obj)
mkvstrip.py(241):         for track_data in json_data["tracks"]:
mkvstrip.py(242):             track_obj = Track(track_data)
BenjV commented 4 years ago

I can write a python script for you that uses ffmpeg to extract the subtitles from the video. Not that complicated at all.

Something like: Input: Videofile Output: Videofile without subs and Dutch + English sub files And of course skipping the Hearing Impaired subs.

Or I could mux those subs also into the output video.

Rouzax commented 4 years ago

What I want is to remove all embedded audio streams and subtitles that do not match the language I set (for me EN and NL) Keeping the HI since some shows will only have the full English subtitle in the HI track since the normal English track might only be the Spanish-speaking parts, Narco for instance.

I strip out the HI and other crap with SubtitleEdit, so I always end up with a clean English and Dutch subtitle

Input: Videofile Embedded Audio: eng, dut, ger Embedded Subs: eng, dut, ger

Output: Videofile Embedded Audio: eng, dut Embedded Subs: (eng, dut) or (none) Extracted SRT: eng, dut

BenjV commented 4 years ago

Ok, I can do that. Do you want to keep the original input file for example renamed with and .old extension or shall I delete it?

Rouzax commented 4 years ago

Delete it

Rouzax commented 4 years ago

I think I know why the python script does nothing on the Tehran episode. The only Audio Track is Hebrew so it will skip removing the subs.

C:\TEMP\Torrent\PROCD>"C:\Python37\python.exe" %mkvstrip% -b %MKVMergeLocation% -v -l eng,dut -s eng,dut -r Forced -t C:\TEMP\Torrent\PROCD\TV\1
Searching for MKV files to process.
Warning: This may take some time...
Checking C:\TEMP\Torrent\PROCD\TV\1\Tehran.S01E01.Emergency.Landing.in.Tehran.1080p.ATVP.WEB-DL.DDP5.1.H.264-NTb.mkv
REMOVE:  Track #1: heb - E-AC-3 - Name:None - Forced:False
REMOVE:  Track #2: heb - SubRip/SRT - Name:Forced - Forced:True
REMOVE:  Track #3: ara - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #4: bul - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #5: chi - SubRip/SRT - Name:Simplified Mandarin - Forced:False
REMOVE:  Track #6: chi - SubRip/SRT - Name:Traditional Mandarin - Forced:False
REMOVE:  Track #7: cze - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #8: dan - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #9: ger - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #10: gre - SubRip/SRT - Name:None - Forced:False
KEEP:  Track #11: eng - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #12: spa - SubRip/SRT - Name:Latin America - Forced:False
REMOVE:  Track #13: spa - SubRip/SRT - Name:Spain - Forced:False
REMOVE:  Track #14: est - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #15: fin - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #16: fre - SubRip/SRT - Name:Canada - Forced:False
REMOVE:  Track #17: fre - SubRip/SRT - Name:France - Forced:False
REMOVE:  Track #18: heb - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #19: heb - SubRip/SRT - Name:SDH - Forced:False
REMOVE:  Track #20: hin - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #21: hun - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #22: ind - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #23: ita - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #24: jpn - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #25: kor - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #26: lit - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #27: lav - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #28: may - SubRip/SRT - Name:None - Forced:False
KEEP:  Track #29: dut - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #30: nor - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #31: pol - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #32: por - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #33: por - SubRip/SRT - Name:Brazil - Forced:False
REMOVE:  Track #34: rus - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #35: slo - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #36: slv - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #37: swe - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #38: tam - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #39: tel - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #40: tha - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #41: tur - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #42: ukr - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #43: vie - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #44: chi - SubRip/SRT - Name:Cantonese - Forced:False
Rouzax commented 4 years ago

This is an example where the Forced EN subtitles are only for the non English parts and I actually need the SDH ones https://partnerhelp.netflixstudios.com/hc/en-us/articles/224198488-What-is-a-Forced-Narrative-Subtitle-

Remuxing: The.Boys.S02E08.What.I.Know.2160p.AMZN.WEBRip.DDP5.1.x265-NTb.mkv
Title: None
============================
Retaining subtitle track(s):
    Track #3: eng - SubRip/SRT - Name:SDH - Forced:False
    Track #20: dut - SubRip/SRT - Name:None - Forced:False
Removing subtitle track(s):
    Track #2: eng - SubRip/SRT - Name:Forced - Forced:True
    Track #4: ara - SubRip/SRT - Name:None - Forced:False
    Track #5: dan - SubRip/SRT - Name:None - Forced:False
    Track #6: ger - SubRip/SRT - Name:None - Forced:False
    Track #7: spa - SubRip/SRT - Name:Latinoamérica - Forced:False
    Track #8: spa - SubRip/SRT - Name:España - Forced:False
    Track #9: fin - SubRip/SRT - Name:None - Forced:False
    Track #10: fil - SubRip/SRT - Name:None - Forced:False
    Track #11: fre - SubRip/SRT - Name:None - Forced:False
    Track #12: heb - SubRip/SRT - Name:None - Forced:False
    Track #13: hin - SubRip/SRT - Name:None - Forced:False
    Track #14: ind - SubRip/SRT - Name:None - Forced:False
    Track #15: ita - SubRip/SRT - Name:None - Forced:False
    Track #16: jpn - SubRip/SRT - Name:None - Forced:False
    Track #17: kor - SubRip/SRT - Name:None - Forced:False
    Track #18: may - SubRip/SRT - Name:None - Forced:False
    Track #19: nor - SubRip/SRT - Name:Norsk Bokmål - Forced:False
    Track #21: pol - SubRip/SRT - Name:None - Forced:False
    Track #22: por - SubRip/SRT - Name:Brasil - Forced:False
    Track #23: por - SubRip/SRT - Name:Portugal - Forced:False
    Track #24: rus - SubRip/SRT - Name:None - Forced:False
    Track #25: swe - SubRip/SRT - Name:None - Forced:False
    Track #26: tam - SubRip/SRT - Name:None - Forced:False
    Track #27: tel - SubRip/SRT - Name:None - Forced:False
    Track #28: tha - SubRip/SRT - Name:None - Forced:False
    Track #29: tur - SubRip/SRT - Name:None - Forced:False
BenjV commented 4 years ago

ok I will extract Dutch, German and English subtitles. If no normal English subs then I will extract the SDH subtitles. Extract nothing in none of the above is present. Create a Video file wil a video stream, English, Dutch or German audio stream and no subtitle stream in the container.

Be aware that the stream identifiers are set by the creator of the video file and that they sometimes are sloppy or just use other names then the ISO identifiers.

Also that example you gave is very strange it has a an English sub but that is just a lsmall part of the movie and an SDH English sub for the whole movie but that SDH in actually a normal subtitle. No way that a script can anticipate on such strange configurations.

Rouzax commented 4 years ago

I don't want German 😀 That is indeed what sometimes is a bit irritating. That is why I throw away the Subs that have the name Forced as that is 99% of the time only the English translation for foreign speech. When using the SDH I can strip out all the HI and other stuff with Subtitle Edit and have two clean and perfect subtitles to my liking 😀.

BenjV commented 4 years ago

Forced subtitles are use for situation where you watch a video without subtitles in for example English and if somebody is speaks a few line in French. Then for that French part they use a forced English subtitle so only that part is subtitled.

Rouzax commented 4 years ago

Correct, that is why I strip out the Forced sub, we like to have English subs for all the spoken parts.

medariox commented 4 years ago

Request to be added to Babelfish: https://github.com/Diaoul/babelfish