Encoding problem - Githubissues

remykarem / python2jupyter

Convert from Python script to Jupyter notebook and vice versa

MIT License

125 stars 32 forks source link

Encoding problem #6

Closed kpym closed 4 years ago

kpym commented 4 years ago

There is an encoding problem. This UTF8 encoded source file:

# Les deux courbes sont très proches

produce the markdown cell

Les deux courbes sont trÃ¨s proches

And the resulting .ipynb is in UTF8 ... but somewhere the encoding was messed up.

remykarem commented 4 years ago

Hi, I'm unable to reproduce this bug. Could you share with me this source file?

kpym commented 4 years ago

Strange that you can't reproduce it. I just transformed my comment to single line utf8 python file

# Les deux courbes sont très proches

and after converting it with p2j I obtain:

{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["Les deux courbes sont tr\u00c3\u00a8s proches"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}

where you can see that très is converted to tr\u00c3\u00a8s.

remykarem commented 4 years ago

Could you please attach the file on this page so that I can try it?

kpym commented 4 years ago

The content of my files is already in my previous comment. But ok, here is the one line .py file available on ghostbin.

remykarem commented 4 years ago

I still can't reproduce the error. What version of Python are you on? And what version of p2j?

kpym commented 4 years ago

I'm using Windows 10, with Python 3.7.7. Knowing the version of p2j is not easy, there is no -v flag and -h flag do not print the version.

First, I installed the last version with pip install p2j : the behaviour with the produced p2j.exe is what I described.

Second, I cloned your repo and run python p2j.py test.py. Same result with \u00c3\u00a8 in place of è.

Probably you use somewhere the system encoding, and probably it is not the same in Windows and in your OS.

remykarem commented 4 years ago

I see. Could you print out the following in a Python interpreter?

import sys
sys.getdefaultencoding()

and

print("è".encode("utf-8").decode())

kpym commented 4 years ago

The answer is (in GitBash and in PowerShell) :

utf-8
è

I change the default encoding in my terminals to utf-8.

remykarem commented 4 years ago

I'm unable to find a solution to this bug. We'll leave it open for now. In the meantime, if you find a fix, do submit a PR.

kpym commented 4 years ago

Thanks for considering this. If I have time, I'll take a look at it.

remykarem commented 4 years ago

Hi @ktzanev that's a good point! @kpym I've merged these changes into master. Can you try again and see if the problem persists?

kpym commented 4 years ago

EDIT: I made a previous comment with a bad account, so I put it back here for historical reasons.

I've checked. You should open the source .py as utf-8 on line 32

with open(source_filename, 'r', encoding='utf-8') as infile:

And you should dump the json as utf-8 on lines 161-162 :

with open(target_filename, 'w', encoding='utf-8') as outfile:
    json.dump(final, outfile, indent=1, ensure_ascii=False)

The default encoding is platform dependent, this is why it was working on some systems and not on others. It is always a good practice to open text files with specified encoding.

remykarem commented 4 years ago

Great :) Closing this issue.