isaacbernat / netflix-to-srt

Rip, extract and convert subtitles to .srt closed captions from .xml/dfxp/ttml and .vtt/WebVTT (e.g. Netflix, YouTube)
MIT License
749 stars 72 forks source link

Accent marks problem #4

Closed sup3rgiu closed 8 years ago

sup3rgiu commented 8 years ago

Hi, today I tried to convert a Netflix .xml sub to a regular .srt sub, howver I noticed that the script doesn't convert correctly accent marks (à,è,ì,ò,ù..)

isaacbernat commented 8 years ago

@sup3rgiu Can you give me a sample file so I can reproduce the error? Thanks!

sup3rgiu commented 8 years ago

Sure. Here the files (original and converted):

test.srt.txt test.xml.txt

(I put .txt as extension to upload them on GitHub, but they are .xml and .srt)

isaacbernat commented 8 years ago

Hi @sup3rgiu I looked at the sample files you gave me. AFAIK the last line which is È una lunga storia. should be È una lunga storia. I think the problem is that the original XML file is already like this (È instead of È). Therefore, this is not a problem with the script which converts it to .srt. Maybe you can save the XML with a different encoding to fix the issue?

sup3rgiu commented 8 years ago

This is a screenshot of what I see in the Google dev tool screenshot_1

So I don't know if this is a problem of Netflix sub format or what.

isaacbernat commented 8 years ago

I am afraid the issue is outside of my scope. I have been searching on how to change character encodings (which seems to be the problem) in Google Chrome but couldn't find a solution. I don't know if changing the language (e.g. to English) would be of any help or if the issue is on Netflix files.

If you find a workaround, please tell me so I can add it to the readme.

SuperrSonic commented 7 years ago

@sup3rgiu Still interested in finding a fix for this? I've downloaded many Spanish subtitles and haven't encountered any problem with wrong characters.

How exactly are you downloading them? Since the guide itself leaves that bit to the user; I imagine many might just copy and paste the subs which already show up incorrectly, instead of getting the data directly, in its untampered form.