rspeer / python-ftfy

Fixes mojibake and other glitches in Unicode text, after the fact.
http://ftfy.readthedocs.org
Other
3.82k stars 120 forks source link

Unbaking Japanese #119

Open wimglenn opened 5 years ago

wimglenn commented 5 years ago
>>> import ftfy
>>> s = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png"
>>> ftfy.ftfy(s)
'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'

Possible? Thanks

rspeer commented 5 years ago

Neither Shift-JIS nor codepage 850 are encodings that ftfy currently tries.

I can see the reason to want to support Shift-JIS; it's pretty much the reason for the word "mojibake", after all. It's something I want to support but I haven't found a good heuristic for distinguishing it from EUC-JP. It even coincidentally sometimes looks like UTF-8.

rspeer commented 2 weeks ago

This is something I want to make another attempt at. I think that understanding the nuances of Shift-JIS mojibake is going to require talking to someone who spends a lot of time on the Japanese-language Internet though

wimglenn commented 2 weeks ago

Good luck. I am not that person. I nominate @bobince.