isaacbernat / netflix-to-srt

Rip, extract and convert subtitles to .srt closed captions from .xml/dfxp/ttml and .vtt/WebVTT (e.g. Netflix, YouTube)
MIT License
749 stars 72 forks source link

New Netflix use of span #26

Closed maresjj closed 4 years ago

maresjj commented 4 years ago

Hi!

From few days ago, Netflix are using span for all lines, including non-italic.

This is an example:

<style tts:backgroundColor="transparent" tts:textAlign="center" xml:id="style0"/>
<style tts:color="white" tts:fontSize="100%" tts:fontWeight="normal" xml:id="style1"/>
<style tts:color="white" tts:fontSize="100%" tts:fontStyle="italic" tts:fontWeight="normal" xml:id="style2"/>
</styling>

Or another example:

<style tts:textAlign="right" xml:id="style0"/>
<style tts:color="white" xml:id="style1"/>
<style tts:textAlign="center" xml:id="style2"/>
<style tts:color="magenta" xml:id="style3"/>
<style tts:color="green" xml:id="style4"/>
<style tts:color="yellow" xml:id="style5"/>
<style tts:color="cyan" xml:id="style6"/>
</styling>

The first example produces that all lines are marked as italic, and the second example, produces this lines once converted to srt:

00:06:02,958 --> 00:06:04,333
<i>-No podemos por ahí.</i>
<span style="style5">¿Por?</i>

62
00:06:06,292 --> 00:06:08,667
<i>¡Es por aquí!</i>
<span style="style1">Tenemos que rodearla.</i>

63
00:06:09,708 --> 00:06:11,292
<i>¿Y cuánto se tarda?</i>
<span style="style1">Media hora.</i>

Thanks

isaacbernat commented 4 years ago

hi @maresjj thanks for reporting. A quick fix would be to ignore span if style attribute starts as " (and not add the closing italics tag there too). But, could you add a non-processed small example I could run to_srt.py on and get the same output as you mention? I want to know if I am missing something (or otherwise you may add a test file and create a Pull Request).

maresjj commented 4 years ago

Hi,

This is the first example:

styling node:

<style tts:backgroundColor="transparent" tts:textAlign="center" xml:id="style0"/>
<style tts:color="white" tts:fontSize="100%" tts:fontWeight="normal" xml:id="style1"/>
<style tts:color="white" tts:fontSize="100%" tts:fontStyle="italic" tts:fontWeight="normal" xml:id="style2"/>

raw xml:

<p begin="60060000t" end="90090000t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle0"><span style="style1">UNA SERIE ORIGINAL DE NETFLIX</span></p>
<p begin="253586667t" end="285702084t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle1"><span style="style1">ANDALUCÍA, ESPAÑA, ACTUALIDAD</span></p>
<p begin="550967084t" end="573072500t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle2"><span style="style2">Toda la vida he soñado con la muerte.</span></p>
<p begin="581831250t" end="606022084t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle3"><span style="style2">Abandono mi cuerpo y me veo desde arriba.</span></p>

script output:

1
00:00:06,006 --> 00:00:09,009
<i>UNA SERIE ORIGINAL DE NETFLIX</i>

2
00:00:25,358 --> 00:00:28,570
<i>ANDALUCÍA, ESPAÑA, ACTUALIDAD</i>

3
00:00:55,096 --> 00:00:57,307
<i>Toda la vida he soñado con la muerte.</i>

4
00:00:58,183 --> 00:01:00,602
<i>Abandono mi cuerpo y me veo desde arriba.</i>

line with italic and non italic at same line:

<p begin="22826136667t" end="22847407917t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle435"><span style="style1">Hay fiestón. Una </span><span style="style2">rave</span><span style="style1"> en la cárcel.</span></p>
436
00:38:02,613 --> 00:38:04,740
<i>Hay fiestón. Una </i><span style="style2">rave</i><i> en la cárcel.</i>

This is the second example:

styling node:

<style tts:textAlign="right" xml:id="style0"/>
<style tts:color="white" xml:id="style1"/>
<style tts:textAlign="center" xml:id="style2"/>
<style tts:color="magenta" xml:id="style3"/>
<style tts:color="green" xml:id="style4"/>
<style tts:color="yellow" xml:id="style5"/>
<style tts:color="cyan" xml:id="style6"/>

raw xml:

<p begin="1362920000t" end="1376670000t" region="region1" style="style2" tts:extent="80.00% 5.33%" tts:origin="10.00% 84.67%" xml:id="subtitle10"><span style="style3">¡Dejad de moverle!</span></p>
<p begin="1397920000t" end="1413750000t" region="region1" style="style2" tts:extent="80.00% 5.33%" tts:origin="10.00% 84.67%" xml:id="subtitle11"><span style="style4">¡No mováis la valla!</span></p>
<p begin="1416670000t" end="1436250000t" region="region1" style="style2" tts:extent="80.00% 10.66%" tts:origin="10.00% 79.34%" xml:id="subtitle12"><span style="style1">(EN INGLÉS) ¡Soy refugiado!</span><br/><span style="style1">¡Soy refugiado!</span></p>

script output:

11
00:02:16,292 --> 00:02:17,667
<i>¡Dejad de moverle!</i>

12
00:02:19,792 --> 00:02:21,375
<i>¡No mováis la valla!</i>

13
00:02:21,667 --> 00:02:23,625
<i>(EN INGLÉS) ¡Soy refugiado!</i>
<i>¡Soy refugiado!</i>

I tried to fix it checking which lines cointain italic as "tts:fontStyle" and making an string with all the styles with italics for use it with regex to decide lines with and without <i> tags. But the problem I found is the lines with both spans, with and without italics, and it makes it wrong:

code:

styles = collection.getElementsByTagName("style")

italics = []
non_italics = []

for style in styles:
   if style.hasAttribute("tts:fontStyle"):
      if (style.getAttribute("tts:fontStyle") == "italic"):
          italics.append(style.getAttribute("xml:id")[-1])
   else:
       non_italics.append(style.getAttribute("xml:id")[-1])

italitcs_str = ""
non_italics_str = ""

for a in italics:
    italitcs_str+=a

for a in non_italics:
    non_italics_str += a
span_start_tags = re.search(span_start_re, s)
span_start_no_tags = re.search(span_start_re_not, s)
if span_start_tags:
    s = u"<i>".join(s.split(span_start_tags.group()))
if span_start_no_tags:
    s = u"".join(s.split(span_start_no_tags.group()))

Only if line match with regex WITH italics, add <i> and </i> at the end.

But with lines with both spans, I get this result:

Lines only with OR without italic: (same lines as first example)

1
00:00:06,006 --> 00:00:09,009
UNA SERIE ORIGINAL DE NETFLIX

2
00:00:25,358 --> 00:00:28,570
ANDALUCÍA, ESPAÑA, ACTUALIDAD

3
00:00:55,096 --> 00:00:57,307
<i>Toda la vida he soñado con la muerte.</i>

4
00:00:58,183 --> 00:01:00,602
<i>Abandono mi cuerpo y me veo desde arriba.</i>

Lines only with AND without italic: (same line as first example)

436
00:38:02,613 --> 00:38:04,740
Hay fiestón. Una </i><i>rave</i> en la cárcel.</i>

Thanks!

isaacbernat commented 4 years ago

Thanks for the comments @maresjj , I see you put effort in it. I have 2 suggestions.

Quick and dirty: Add a postprocessing step

This could be done in xml_to_srt after content value is "ready": https://github.com/isaacbernat/netflix-to-srt/blob/596491c3701f3b0d0d34c05e532703bdf636eac6/to_srt.py#L114-L116

if span_end_tags:
    content = u"</i>".join(content.split(span_end_tags.group()))
    content = cleanup_excess_closing_italics(content)

Or more generally on append_subs e.g.: https://github.com/isaacbernat/netflix-to-srt/blob/596491c3701f3b0d0d34c05e532703bdf636eac6/to_srt.py#L78-L83

def append_subs(start, end, prev_content, format_time):
    subs.append({
        "start_time": convert_time(start) if format_time else start,
        "end_time": convert_time(end) if format_time else end,
        "content": u"\n".join(postprocess_line(prev_content)),
    })

Replace only spans that should have italics and remove all the others

One way to do it would be here: https://github.com/isaacbernat/netflix-to-srt/blob/596491c3701f3b0d0d34c05e532703bdf636eac6/to_srt.py#L96-L99 My regexes are a bit rusty, this is just pseudocode of a possible implementation:

style_names = get_italics_style_names(text)
for style_name in style_names:
    regex_with_style_names = re.compile('(<span style={}>)+(</span>)+'.format(style_name)
    # there's probably a way to do this without a loop on style_names
    # having them as `.format("|".join(style_names))` or something.

    tmp_content = ""
    previous_match = 0
    for m in re.finditer(regex_with_style_names, content):
        # for group 0, which is the opening span tag
        tmp_content += content[previous_match:m.start] + "<i>"
        previous_match m.end

        # for group 1, which is the opening closing span tag for that opening tag: the closest one after
        tmp_content += content[previous_match:m.start] + "</i>"
        previous_match m.end
    content = tmp_content if tmp_content else content

# delete all remaining spans that do not require any additional stylings. E.g. similar to existing code
span_start_re = re.compile(u'(<span style=\"[a-zA-Z0-9_.]+\">)+')
span_end_re = re.compile(u'(</span>)+')

span_start_tags = re.search(span_start_re, s)
if span_start_tags:
    content = u"".join(content.split(span_start_tags.group()))
if span_end_tags:
    content = u"".join(content.split(span_end_tags.group()))

Would you like to write a PR using one of them?

Also, if you like the repo consider starring it in GitHub ;)

isaacbernat commented 4 years ago

@maresjj do you need more time/help with the Pull Request, or would you rather I fix it myself?

isaacbernat commented 4 years ago

hi @maresjj I created a branch (issue-26) and a PR (https://github.com/isaacbernat/netflix-to-srt/pull/30/files) which I think fixes the italics you noticed. Want to clone it and try it before I merge it to master?

isaacbernat commented 4 years ago

@maresjj I merged the fix to master, I am closing this issue.