morpheus65535 / bazarr

Bazarr is a companion application to Sonarr and Radarr. It manages and downloads subtitles based on your requirements. You define your preferences by TV show or movie and Bazarr takes care of everything for you.
https://www.bazarr.media
GNU General Public License v3.0
2.79k stars 218 forks source link

Detecting Language of embedded subtitles when language metadata is not set #2007

Closed Dnkhatri closed 1 year ago

Dnkhatri commented 1 year ago

Describe the bug I have come across files that have multiple internal subtitles but the language metadata is not set and the show up as unknown in most players. Embedded subtitles downloader is also unable to extract such subs

To Reproduce a mediainfo of such a file

`General Complete name :test.mp4 Format : MPEG-4 Format profile : Base Media Codec ID : isom (isom/iso2/avc1/mp41) File size : 645 MiB Duration : 41 min 20 s Overall bit rate mode : Variable Overall bit rate : 2 180 kb/s Writing application : Lavf58.22.100

Video ID : 1 Format : AVC Format/Info : Advanced Video Codec Format profile : High@L4 Format settings : CABAC / 3 Ref Frames Format settings, CABAC : Yes Format settings, Reference frames : 3 frames Codec ID : avc1 Codec ID/Info : Advanced Video Coding Duration : 41 min 19 s Bit rate : 1 983 kb/s Width : 1 920 pixels Height : 1 080 pixels Display aspect ratio : 16:9 Frame rate mode : Constant Frame rate : 25.000 FPS Color space : YUV Chroma subsampling : 4:2:0 Bit depth : 8 bits Scan type : Progressive Bits/(Pixel*Frame) : 0.038 Stream size : 586 MiB (91%) Color range : Limited Color primaries : BT.709 Transfer characteristics : BT.709 Matrix coefficients : BT.709 Codec configuration box : avcC

Audio ID : 2 Format : AAC LC Format/Info : Advanced Audio Codec Low Complexity Codec ID : mp4a-40-2 Duration : 41 min 20 s Bit rate mode : Variable Bit rate : 192 kb/s Maximum bit rate : 205 kb/s Channel(s) : 2 channels Channel layout : L R Sampling rate : 44.1 kHz Frame rate : 43.066 FPS (1024 SPF) Compression mode : Lossy Stream size : 56.1 MiB (9%) Default : Yes Alternate group : 1

Text #1 ID : 3 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 37 min 30 s Bit rate mode : Variable Bit rate : 65 b/s Stream size : 17.8 KiB (0%) Title : 繁體中文 Default : Yes Forced : No Alternate group : 3

Text #2 ID : 4 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 37 min 30 s Bit rate mode : Variable Bit rate : 76 b/s Stream size : 20.8 KiB (0%) Title : 馬來語 Default : No Forced : No Alternate group : 3

Text #3 ID : 5 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 41 min 12 s Bit rate mode : Variable Bit rate : 69 b/s Stream size : 20.7 KiB (0%) Title : 英語 Default : No Forced : No Alternate group : 3

Text #4 ID : 6 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 41 min 12 s Bit rate mode : Variable Bit rate : 188 b/s Stream size : 56.8 KiB (0%) Title : 泰語 Default : No Forced : No Alternate group : 3

Text #5 ID : 7 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 37 min 30 s Bit rate mode : Variable Bit rate : 65 b/s Stream size : 17.8 KiB (0%) Title : 簡體中文 Default : No Forced : No Alternate group : 3

`

Expected behavior bazarr should detect and extract the language subtitles. if you are wondering how by using this python script that I came across when I run it manually against extracted subtitles it is able to detect the language of the subtitles

https://github.com/mdcollins05/srt-lang-detect

TRaSH- commented 1 year ago

I personally would setup a regex and block rlsgrp that release stuff like that.

But it might be still a nice addition when the language isn't set by the lame rlsgrp it will try to detect it. the reason why i call them lame is because any proper rlsgrp would name and tag their stuff properly

Dnkhatri commented 1 year ago

Yeah but when it comes to chinese/asian shows it is hard to get files without hardsub so a group that releases softsubs is a rarity so have to be happy with what you get. Some groups don't even set the language tag for the audio or it is set incorrectly to english.

morpheus65535 commented 1 year ago

I have no plan to compensate botched releases with CPU intensive patches on Bazarr side. This is the expected behavior, not a bug. If you want a new feature, fill a FR here: https://bazarr.featureupvote.com/

Dnkhatri commented 1 year ago

Turns out I was wrong these are not broken releases but actually an older subtitle format. https://en.wikipedia.org/wiki/MPEG-4_Part_17. 英語 means english . So I am just mentioning it here as that means probably a non CPU intensive way to detect subtitle language can be found before they need to be extracted.

Dnkhatri commented 1 year ago

@morpheus65535 @TRaSH- Hi I know you disagree with my first suggestion because of the CPU intensive method so I have come up with another method. After days of trying I was not able to get the subtitle title metadata using ffprobe it seems like ffprobe does not parse it. So just like how when ffprobe is unable to parse the language of a subtitle for an mkv file enzyme is called I think we should call mediainfo to check if the title of the subtitles matches the language names in the ISO639 list if it does then bazarr can detect the subtitle and language this would make it less cpu intensive. I know that wont work with the original mediainfo in my original post. But it will still work with files like these where the titles are English. The Ideal scenario would be to add the chinese language names to the ISO639 table that bazarr uses so chinese names are compared as well but I would happy with the english names alone. So far I have come across files in these 2 languages only. Sadly I can't program for shit to submit a patch myself.

`General Complete name : test.mp4 Format : MPEG-4 Format profile : Base Media Codec ID : isom (isom/iso2/avc1/mp41) File size : 612 MiB Duration : 47 min 45 s Overall bit rate mode : Variable Overall bit rate : 1 792 kb/s Writing application : Lavf58.22.100

Video ID : 1 Format : AVC Format/Info : Advanced Video Codec Format profile : High@L4 Format settings : CABAC / 3 Ref Frames Format settings, CABAC : Yes Format settings, Reference frames : 3 frames Codec ID : avc1 Codec ID/Info : Advanced Video Coding Duration : 47 min 45 s Bit rate : 1 591 kb/s Width : 1 920 pixels Height : 800 pixels Display aspect ratio : 2.40:1 Frame rate mode : Constant Frame rate : 25.000 FPS Color space : YUV Chroma subsampling : 4:2:0 Bit depth : 8 bits Scan type : Progressive Bits/(Pixel*Frame) : 0.041 Stream size : 544 MiB (89%) Color range : Limited Color primaries : BT.709 Transfer characteristics : BT.709 Matrix coefficients : BT.709 Codec configuration box : avcC

Audio ID : 2 Format : AAC LC Format/Info : Advanced Audio Codec Low Complexity Codec ID : mp4a-40-2 Duration : 47 min 45 s Bit rate mode : Variable Bit rate : 192 kb/s Maximum bit rate : 259 kb/s Channel(s) : 2 channels Channel layout : L R Sampling rate : 44.1 kHz Frame rate : 43.066 FPS (1024 SPF) Compression mode : Lossy Stream size : 65.6 MiB (11%) Default : Yes Alternate group : 1

Text #1 ID : 3 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 99 b/s Frame rate : 0.759 FPS Stream size : 32.9 KiB (0%) Title : English Default : Yes Forced : No Alternate group : 3

Text #2 ID : 4 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 100 b/s Frame rate : 0.674 FPS Stream size : 33.0 KiB (0%) Title : Arabic Default : No Forced : No Alternate group : 3

Text #3 ID : 5 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 113 b/s Frame rate : 0.759 FPS Stream size : 37.4 KiB (0%) Title : Korean Default : No Forced : No Alternate group : 3

Text #4 ID : 6 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 67 b/s Frame rate : 0.674 FPS Stream size : 22.1 KiB (0%) Title : Traditional Chinese Default : No Forced : No Alternate group : 3

Text #5 ID : 7 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 102 b/s Frame rate : 0.759 FPS Stream size : 33.6 KiB (0%) Title : Bahasa Malaysia Default : No Forced : No Alternate group : 3

Text #6 ID : 8 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 67 b/s Frame rate : 0.674 FPS Stream size : 22.1 KiB (0%) Title : Simplified Chinese Default : No Forced : No Alternate group : 3

Text #7 ID : 9 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 123 b/s Frame rate : 0.753 FPS Stream size : 40.8 KiB (0%) Title : Vietnamese Default : No Forced : No Alternate group : 3

Text #8 ID : 10 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 77 b/s Frame rate : 0.674 FPS Stream size : 25.4 KiB (0%) Title : Spanish Default : No Forced : No Alternate group : 3

Text #9 ID : 11 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 106 b/s Frame rate : 0.758 FPS Stream size : 35.1 KiB (0%) Title : Bahasa Indonesia Default : No Forced : No Alternate group : 3

Text #10 ID : 12 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 228 b/s Frame rate : 0.760 FPS Stream size : 75.3 KiB (0%) Title : Thai Default : No Forced : No Alternate group : 3

Text #11 ID : 13 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 45 min 10 s Bit rate mode : Variable Bit rate : 101 b/s Frame rate : 0.674 FPS Stream size : 33.4 KiB (0%) Title : Japanese Default : No Forced : No Alternate group : 3

`

morpheus65535 commented 1 year ago

Can you send me a sample file in PM on Discord So I take a look at it? I have no plan to include another video file parser with Bazarr but I'll see what I can do with ffprobe.

Dnkhatri commented 1 year ago

thanks for looking into this I have sent you a link in discord. I was not able to get ffprobe to read the data mediainfo was showing

morpheus65535 commented 1 year ago

The main issue here was that embedded subtitles without language code in metadata were ignored by Bazarr.

I've decided to add an option to use mediainfo (if desired AND already installed) as embedded subtitles parser. It will try to use the subtitles track title and match it to an existing language name in plain English text only. It won't be able to match foreign language spelled language names (ie: Chinese alphabet chracters).

With the media file you provided, it indexed 7 additional embedded subtitles. Give it a try and let me know what you think about this.

You'll see the option in upcoming beta under Settings-->Subtitles.

Dnkhatri commented 1 year ago

I am getting this error |BAZARR Error ('mediainfo') trying to get video information for this file: /home/daniyal/Media/Asian Dramas/First Love/S01E01 - Episode 1.mp4|'Traceback (most recent call last):\n File "/opt/bazarr/bazarr/subtitles/utils.py", line 47, in get_video\n refiner(original_path, video)\n File "/opt/bazarr/bazarr/subtitles/refiners/ffprobe.py", line 35, in refine_from_ffprobe\n if not data[\'ffprobe\'] or data[\'mediainfo\']:\nKeyError: \'mediainfo\''|

morpheus65535 commented 1 year ago

You must run disk scan for this series or movie first. I'm working to add failsafe to refresh cache if desired embedded subtitles parser isn't already present in database.

morpheus65535 commented 1 year ago

Should be fixed in upcoming beta.

morpheus65535 commented 1 year ago

@Dnkhatri working as expected?

Dnkhatri commented 1 year ago

Sorry was busy with year end stuff. Yes they are being detected though I hope the setting will be brought out from Use embedded subtitles in media files when determining missing ones setting as I would like the embedded provider to extract them.

morpheus65535 commented 1 year ago

@Dnkhatri as far as I know, this is not possible as embedded subtitles provider use ffmpeg to extract the track where mediainfo, as it's name clearly state, is only able to show informations about tracks and not extract them. @vitiko98 am I wrong?

Dnkhatri commented 1 year ago

ffmpeg can extract it just does not know the language so if mediainfo is giving the trackinfo ie track number and language to bazarr it should be possible to extract with ffmpeg like normal. Thats what I am doing manually at the moment using ffmpeg to extract and then using the python script I mentioned in the opening post.

morpheus65535 commented 1 year ago

@Dnkhatri how do you match the subtitles tracks numbers between both tools?

Dnkhatri commented 1 year ago

@morpheus65535 The subtitles are not random they are numbered for example in the Text #1 corresponds to the first subtitle stream which is 0 in ffmpeg . So if you extract subtitle stream 0, 1, 2, 3 with ffmpeg the will be 0 traditional chinese 1 thai 2 Bahasa Malaysia etc

I Text #1 ID : 3 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 40 min 43 s Bit rate mode : Variable Bit rate : 46 b/s Frame rate : 0.450 FPS Stream size : 13.6 KiB (0%) Title : Traditional Chinese Default : Yes Forced : No Alternate group : 3

Text #2 ID : 4 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 40 min 43 s Bit rate mode : Variable Bit rate : 157 b/s Frame rate : 0.489 FPS Stream size : 46.9 KiB (0%) Title : Thai Default : No Forced : No Alternate group : 3

Text #3 ID : 5 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 40 min 43 s Bit rate mode : Variable Bit rate : 70 b/s Frame rate : 0.489 FPS Stream size : 21.0 KiB (0%) Title : Bahasa Malaysia Default : No Forced : No Alternate group : 3

Text #4 ID : 6 Format : Timed Text Muxing mode : sbtl Codec ID : tx3g Duration : 40 min 43 s Bit rate mode : Variable Bit rate : 70 b/s Frame rate : 0.450 FPS Stream size : 20.8 KiB (0%) Title : Japanese Default : No Forced : No Alternate group

morpheus65535 commented 1 year ago

@Dnkhatri ok thanks for this. I've never really tried to understand how it worked since we were only using one tool.