Closed maresjj closed 4 years ago
hi @maresjj thanks for reporting. A quick fix would be to ignore span if style attribute starts as "
(and not add the closing italics tag there too). But, could you add a non-processed small example I could run to_srt.py
on and get the same output as you mention? I want to know if I am missing something (or otherwise you may add a test file and create a Pull Request).
Hi,
This is the first example:
styling node:
<style tts:backgroundColor="transparent" tts:textAlign="center" xml:id="style0"/>
<style tts:color="white" tts:fontSize="100%" tts:fontWeight="normal" xml:id="style1"/>
<style tts:color="white" tts:fontSize="100%" tts:fontStyle="italic" tts:fontWeight="normal" xml:id="style2"/>
raw xml:
<p begin="60060000t" end="90090000t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle0"><span style="style1">UNA SERIE ORIGINAL DE NETFLIX</span></p>
<p begin="253586667t" end="285702084t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle1"><span style="style1">ANDALUCÍA, ESPAÑA, ACTUALIDAD</span></p>
<p begin="550967084t" end="573072500t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle2"><span style="style2">Toda la vida he soñado con la muerte.</span></p>
<p begin="581831250t" end="606022084t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle3"><span style="style2">Abandono mi cuerpo y me veo desde arriba.</span></p>
script output:
1
00:00:06,006 --> 00:00:09,009
<i>UNA SERIE ORIGINAL DE NETFLIX</i>
2
00:00:25,358 --> 00:00:28,570
<i>ANDALUCÍA, ESPAÑA, ACTUALIDAD</i>
3
00:00:55,096 --> 00:00:57,307
<i>Toda la vida he soñado con la muerte.</i>
4
00:00:58,183 --> 00:01:00,602
<i>Abandono mi cuerpo y me veo desde arriba.</i>
line with italic and non italic at same line:
<p begin="22826136667t" end="22847407917t" region="region0" style="style0" tts:extent="80.00% 80.00%" tts:origin="10.00% 10.00%" xml:id="subtitle435"><span style="style1">Hay fiestón. Una </span><span style="style2">rave</span><span style="style1"> en la cárcel.</span></p>
436
00:38:02,613 --> 00:38:04,740
<i>Hay fiestón. Una </i><span style="style2">rave</i><i> en la cárcel.</i>
This is the second example:
styling node:
<style tts:textAlign="right" xml:id="style0"/>
<style tts:color="white" xml:id="style1"/>
<style tts:textAlign="center" xml:id="style2"/>
<style tts:color="magenta" xml:id="style3"/>
<style tts:color="green" xml:id="style4"/>
<style tts:color="yellow" xml:id="style5"/>
<style tts:color="cyan" xml:id="style6"/>
raw xml:
<p begin="1362920000t" end="1376670000t" region="region1" style="style2" tts:extent="80.00% 5.33%" tts:origin="10.00% 84.67%" xml:id="subtitle10"><span style="style3">¡Dejad de moverle!</span></p>
<p begin="1397920000t" end="1413750000t" region="region1" style="style2" tts:extent="80.00% 5.33%" tts:origin="10.00% 84.67%" xml:id="subtitle11"><span style="style4">¡No mováis la valla!</span></p>
<p begin="1416670000t" end="1436250000t" region="region1" style="style2" tts:extent="80.00% 10.66%" tts:origin="10.00% 79.34%" xml:id="subtitle12"><span style="style1">(EN INGLÉS) ¡Soy refugiado!</span><br/><span style="style1">¡Soy refugiado!</span></p>
script output:
11
00:02:16,292 --> 00:02:17,667
<i>¡Dejad de moverle!</i>
12
00:02:19,792 --> 00:02:21,375
<i>¡No mováis la valla!</i>
13
00:02:21,667 --> 00:02:23,625
<i>(EN INGLÉS) ¡Soy refugiado!</i>
<i>¡Soy refugiado!</i>
I tried to fix it checking which lines cointain italic as "tts:fontStyle" and making an string with all the styles with italics for use it with regex to decide lines with and without <i>
tags. But the problem I found is the lines with both spans, with and without italics, and it makes it wrong:
code:
styles = collection.getElementsByTagName("style")
italics = []
non_italics = []
for style in styles:
if style.hasAttribute("tts:fontStyle"):
if (style.getAttribute("tts:fontStyle") == "italic"):
italics.append(style.getAttribute("xml:id")[-1])
else:
non_italics.append(style.getAttribute("xml:id")[-1])
italitcs_str = ""
non_italics_str = ""
for a in italics:
italitcs_str+=a
for a in non_italics:
non_italics_str += a
span_start_tags = re.search(span_start_re, s)
span_start_no_tags = re.search(span_start_re_not, s)
if span_start_tags:
s = u"<i>".join(s.split(span_start_tags.group()))
if span_start_no_tags:
s = u"".join(s.split(span_start_no_tags.group()))
Only if line match with regex WITH italics, add <i>
and </i>
at the end.
But with lines with both spans, I get this result:
Lines only with OR without italic: (same lines as first example)
1
00:00:06,006 --> 00:00:09,009
UNA SERIE ORIGINAL DE NETFLIX
2
00:00:25,358 --> 00:00:28,570
ANDALUCÍA, ESPAÑA, ACTUALIDAD
3
00:00:55,096 --> 00:00:57,307
<i>Toda la vida he soñado con la muerte.</i>
4
00:00:58,183 --> 00:01:00,602
<i>Abandono mi cuerpo y me veo desde arriba.</i>
Lines only with AND without italic: (same line as first example)
436
00:38:02,613 --> 00:38:04,740
Hay fiestón. Una </i><i>rave</i> en la cárcel.</i>
Thanks!
Thanks for the comments @maresjj , I see you put effort in it. I have 2 suggestions.
</i>
tags that don't have opening tags. Finditer might be useful for this https://docs.python.org/2/library/re.html#finding-all-adverbs-and-their-positionsThis could be done in xml_to_srt
after content
value is "ready":
https://github.com/isaacbernat/netflix-to-srt/blob/596491c3701f3b0d0d34c05e532703bdf636eac6/to_srt.py#L114-L116
if span_end_tags:
content = u"</i>".join(content.split(span_end_tags.group()))
content = cleanup_excess_closing_italics(content)
Or more generally on append_subs
e.g.:
https://github.com/isaacbernat/netflix-to-srt/blob/596491c3701f3b0d0d34c05e532703bdf636eac6/to_srt.py#L78-L83
def append_subs(start, end, prev_content, format_time):
subs.append({
"start_time": convert_time(start) if format_time else start,
"end_time": convert_time(end) if format_time else end,
"content": u"\n".join(postprocess_line(prev_content)),
})
One way to do it would be here: https://github.com/isaacbernat/netflix-to-srt/blob/596491c3701f3b0d0d34c05e532703bdf636eac6/to_srt.py#L96-L99 My regexes are a bit rusty, this is just pseudocode of a possible implementation:
style_names = get_italics_style_names(text)
for style_name in style_names:
regex_with_style_names = re.compile('(<span style={}>)+(</span>)+'.format(style_name)
# there's probably a way to do this without a loop on style_names
# having them as `.format("|".join(style_names))` or something.
tmp_content = ""
previous_match = 0
for m in re.finditer(regex_with_style_names, content):
# for group 0, which is the opening span tag
tmp_content += content[previous_match:m.start] + "<i>"
previous_match m.end
# for group 1, which is the opening closing span tag for that opening tag: the closest one after
tmp_content += content[previous_match:m.start] + "</i>"
previous_match m.end
content = tmp_content if tmp_content else content
# delete all remaining spans that do not require any additional stylings. E.g. similar to existing code
span_start_re = re.compile(u'(<span style=\"[a-zA-Z0-9_.]+\">)+')
span_end_re = re.compile(u'(</span>)+')
span_start_tags = re.search(span_start_re, s)
if span_start_tags:
content = u"".join(content.split(span_start_tags.group()))
if span_end_tags:
content = u"".join(content.split(span_end_tags.group()))
Also, if you like the repo consider starring it in GitHub ;)
@maresjj do you need more time/help with the Pull Request, or would you rather I fix it myself?
hi @maresjj I created a branch (issue-26) and a PR (https://github.com/isaacbernat/netflix-to-srt/pull/30/files) which I think fixes the italics you noticed. Want to clone it and try it before I merge it to master?
@maresjj I merged the fix to master, I am closing this issue.
Hi!
From few days ago, Netflix are using span for all lines, including non-italic.
This is an example:
Or another example:
The first example produces that all lines are marked as italic, and the second example, produces this lines once converted to srt:
Thanks