open-contracting / ocdskit

A suite of command-line tools for working with OCDS data
https://ocdskit.readthedocs.io
BSD 3-Clause "New" or "Revised" License
17 stars 6 forks source link

indent: Encoding issue on Windows #148

Closed duncandewhurst closed 3 years ago

duncandewhurst commented 3 years ago

Running the following commands on Linux results an utf-8 encoded file:

curl http://200.13.162.79/datosabiertos/HC1/HC1_datos_2020_json.zip > honduras.zip
unzip -o honduras.zip
ocdskit indent HC1_datos_2020.json

But running the equivalent commands on Windows results in an iso-8859-1 encoded file:

curl http://200.13.162.79/datosabiertos/HC1/HC1_datos_2020_json.zip > honduras.zip
tar -x -f honduras.zip
ocdskit indent HC1_datos_2020.json

On Windows, the output of more HC1_datos_2020.json differs before and after running ocdskit indent:

Before indenting, the output includes:

"name": "Secretaria de Salud P\u00fablica"

After indenting, the output includes:

"name": "Secretaria de Salud P�blica"

PYTHONIOENCODING is set to utf-8 and the terminal code page is set to 65001 (utf-8).

jpmckinney commented 3 years ago

What precise version of Windows are you running? Apparently some things were fixed in Windows 10 October 2018 Update (build 1809).

How did you set the terminal code page? chcp 65001? Did you set LC_CTYPE=en_US.utf-8? What is the output of python -c "import sys; print(sys.stdout.encoding)"?

duncandewhurst commented 3 years ago

What precise version of Windows are you running? Apparently some things were fixed in Windows 10 October 2018 Update (build 1809).

Windows 10 Pro version 1909 OS build 18363.900

How did you set the terminal code page?

chcp 65001

Did you set LC_CTYPE=en_US.utf-8?

No. I tested again after setting it and got the same result.

What is the output of python -c "import sys; print(sys.stdout.encoding)"?

utf-8 (before and after setting LC_CTYPE)

jpmckinney commented 3 years ago

After indenting, the output includes:

How are you reading the output? Can you upload it?

duncandewhurst commented 3 years ago

For troubleshooting purposes, I am reading the output using more, but originally I came across the issue because I tried to run cat HC1_datos_2020.json | ocdskit compile > honduras_compiled_releases.json after following the steps in this section of the OCDS Kit Learning Lab to download, extract and indent the file. ocdskit compile reported an encoding error and suggested trying --encoding iso-8859-1.

I've uploaded the file before indenting and after indenting.

jpmckinney commented 3 years ago

Hmm, okay, I'll need to check how Windows determines the output encoding for the indent command. The behavior is a bit surprising.

jpmckinney commented 3 years ago

@duncandewhurst Can you use ocdskit --ascii indent and re-upload the output? This will tell me if the issue is with the output encoding or with the internal representation.

duncandewhurst commented 3 years ago

Thanks for continuing to troubleshoot! Let me know if a screen-share would be helpful:

https://drive.google.com/file/d/1C9sTWf-4cv6auXJu5M-10YIl3s4zrb83/view?usp=sharing

jpmckinney commented 3 years ago

I notice the latest document is identical to an earlier one, which is because the indent command ignores the --ascii option 🙃 I'll fix that first.

jpmckinney commented 3 years ago

@duncandewhurst I've now fixed that on HEAD, if you can install from GitHub and run again.

duncandewhurst commented 3 years ago

I've uploaded the new version to the same URL: https://drive.google.com/file/d/1C9sTWf-4cv6auXJu5M-10YIl3s4zrb83/view?usp=sharing

jpmckinney commented 3 years ago

Fascinating. The ASCII output correctly encodes the UTF-8 character (e.g. \u00f3 for ó). I guess that's thanks to Python, which does the encoding.

Anyway, looks like there's another magic environment variable to make Windows use UTF8 when reading/writing files like the rest of the world: PYTHONUTF8=1

https://dev.to/methane/python-use-utf-8-mode-on-windows-212i

jpmckinney commented 3 years ago

Also, should have asked earlier, what version of Python is this? python --version

duncandewhurst commented 3 years ago

Anyway, looks like there's another magic environment variable to make Windows use UTF8 when reading/writing files like the rest of the world: PYTHONUTF8=1

Ah, that did the trick. I've added it to the learning lab instructions for Windows users.

Python version is 3.8.4

jpmckinney commented 3 years ago

I've added that instruction to the docs. Closing.