mdcollins05 / srt-lang-detect

A tool to detect the language used and rename SRT subtitle files
7 stars 1 forks source link

VTT support? #6

Open LostOnTheLine opened 2 years ago

LostOnTheLine commented 2 years ago

I haven't tried to use this yet because most of the subtitles I have are in VTT format (A much superior format to anything else) but I suspect it should work fairly easily as VTT is essentially SRT with a bit of HTML markup for formatting.

LostOnTheLine commented 2 years ago

If possible when using VTT it would be nice to add the Tag to the file itself, so long as that isn't a lot of work. Some Subtitles sources use this, but most don't. Every VTT file starts with the 1st line "WEBVTT" but you can add a space after that (Usually a " - ") & add a note, which is sometimes used for "WEBVTT - English" or "WEBVTT - English CC" or "WEBVTT - ENG-CC" or "WEBVTT - ENG SDH" or even sometimes "WEBVTT - This file is the Closed Captioning from the US DVD" As I assume there isn't an easy way to identify it as a CC/SDH file, though that would be cool, having it label them inside the file as well would be nice, My personal preference would be the "WEBVTT - ENG" ISO-639-2/B codes & some programs will actually read them to identify Language, though I don't think there's an official format for those tags as though they are usually there at the top, I've seen them as a second line, & even at the bottom of the file

mdcollins05 commented 2 years ago

Is there a naming scheme that is different for a vtt file or is it simply to support a different file extension?

LostOnTheLine commented 2 years ago

The naming is the same, VTT is essentially advanced SRT with HTML formatting. The 1st line of VTT is always "WEBVTT" The subtitles do not need to be numbered like SRT does, but can support the numbering if used. Sometimes they will have a line or 2 under it "WEBVTT", "Kind: captions", "Language: en" or will have a "WEBVTT - English"

I assume that if it is able to work with SRT it should be able to identify the Language in the same way. When trying to identify a language I sometimes Ctrl-A - Ctrl-C & then paste the whole text in Google Translate, if that tells you anything. In general the naming of the files is the same as SRT & SSA with the Title.ENG.vtt or Title.EN.vtt

WebVTT is actually built into HTML5 as it's native Subtitle format, so anything that uses HTML5 can inherently understand them.

That's all the information that I can think of that might help, but I assume simple-mode if it was treated the same as SRT it should work the same as SRT does. I just don't want to have to rename them all to SRT to detect the language then have to rename them all back

LostOnTheLine commented 2 years ago

If adding that isn't doable because it can only handle using a single extension If you can tell me where the SRT would need to be changed to just do VTT instead of SRT I'm willing to edit it myself, & just have 2 separate tools 1 for SRT, 1 for VTT, I just don't know what I'm looking for

LostOnTheLine commented 2 years ago

So I've edited the srtlangdetect.py & replaced every instance of "SRT" with "VTT" which seem to work fine. but as there are a lot of instances & I'm sure not all are needed to change I was hoping for a way to support it inherently. as it stands now, if anyone else is looking for this, opening the file in a text editor & doing to replace seems to work.

LostOnTheLine commented 2 years ago

So it looks like copying lines 28-38 & making the 2nd one if file.endswith(".vtt"): appears to make it work with both. I haven't tested it extensively but the html formatting may be causing some issues as I get a lot that end up labeled as ENG that are not. I don't know if that's a result of the VTT HTML or if just finds enough random words or English phrases. Does that happen often with SRT on the default settings?

mdcollins05 commented 2 years ago

I'm glad you were able to get something working.

As you've discovered, the script just looks at all the contents in the file to determine the language. It no longer parses SRT files specifically. This means we could support multiple file extensions but it will take all the words into account, which could skew the language detection results as you've seen.

I'm torn if this tool should support multiple formats or not. I wrote it for SRTs specifically and I only use SRTs, so I wouldn't have real world tests. A fork may be better specifically for VTT files but I'm not set on a fork vs PR to add support into this repo. I'll have to give it some thought.

LostOnTheLine commented 2 years ago

No worries. But VTT is the format of the future. Aside from the things I personally like about it, it's natively supported within HTML5. I actually converted from SRT to VTT just a year or so ago, & I haven't gone back. So I'd recommend checking it out. Almost all online services use it, YouTube uses it, even though they are trying to push their proprietary format, most streaming services use it. It can do everything that ASS can do except for karaoke follow formatting, & the formatting is basic HTML not the bloated soup that is SSA.

Anyway, is there anyway to have the detection ignore anything in <>? because that would be most of it. I figure it's already ignoring the timing data so it's not too far a stretch to think it could be done... I just don't know

LostOnTheLine commented 2 years ago

Some Examples of files that were detected as ENG

00:03.513 --> 01:00.949
<font color="#CCCCCC">**** ترجَمّـــــــ ـــــة ****</font>
<font color=#FFFF00># مــ ــحـ  ــمـ ــد حـ ــمــ ــــدى #</font>
QENA-<font color=#C6423D>E</font>g<font color="#808080">y</font>

01:07.137 --> 01:13.137
<font color=#FFFF00>**** الجندي المتخفى ****</font>

01:15.000 --> 01:21.074
قم بالإعلان هنا عن منتجك أو علامتك التجارية
اليوم www.OpenSubtitles.org تواصل معنا 

<font color="#CCCCCC">**** ترجَمّـــــــ ـــــة ****</font>
<font color=#FFFF00># مــ ــحـ  ــمـ ــد حـ ــمــ ــــدى #</font>
QENA-<font color=#C6423D>E</font>g<font color="#808080">y</font>

All the rest is in the Arabic Alphabet.

With languages that use the romantic alphabet the most common formatting would be things like <i>for italics</i> or <b> for bold </b> & sometimes the "You're using OpenSubtitles" messages will be at the front & back of the files, but the same applied to their SRT subtitles so I don't expect that is causing the problem

I think if there was a way to exclude anything with brackets <> it should work better, I'm just not sure how to go about that. Though I suppose if it could detect that then having it mark the SDH files would probably be possible too.

Anyways. This is the information I have, I'm sharing it in case it helps, don't expect anything to come of it, just hoping