Pipfile should explicitly specify an encoding

uranusjr commented 6 years ago

Spun off from #1842.

Currently Pipenv does not handle non-ASCII characters very well. Pipfile is parsed with the platform’s default encoding; this becomes a problem for cross-platform projects if there are non-ASCII paths in the project, especially for the scripts section, but also for local dependencies’ path value.

A straightforward solution would be to strictly require Pipfiles to be encoded with UTF-8. This is in line with the best practice of the Python community in general, including core development. Pipenv would then use io.open instead of open to read Pipfiles instead (in project.py), so contents, including paths, are always read as text in the correct encoding.

techalchemy commented 6 years ago

I think this makes a lot of sense, does anything break if we implement it right now?

uranusjr commented 6 years ago

Yes, if you already have a platform-specific encoded Pipfile with non-ASCII stuff in it. But this is easily amendable by re-saving the Pipfile with UTF-8.

techalchemy commented 6 years ago

How can we gracefully warn people about that when we encounter it?

uranusjr commented 6 years ago

The parsing would almost always fail with UnicodeDecodeError if the encoding is not UTF-8, so it would probably be enough to catch that and emit a prettier error.

I doubt that would be necessary though. This would happen mainly on 1. Windows 2. POSIX systems with non-UTF-8 encoding. The latter is extremely unconventional these days (and I believe is not supported anyway). The former is probably next to non-existent, based on how many fundamental issues brettcannon (avoiding an @ here) was able to file when he adopted Pipenv through VS Code.

techalchemy commented 6 years ago

Can we fall back to their default system encoding if we handle that exception?

jtratner commented 6 years ago

python 2 may be a bit hairy for this 😭

uranusjr commented 6 years ago

@jtratner Yeah it needs careful testing, but it can be pulled off. Many projects already did.

@techalchemy It should be. I’m just not sure about the necessity…

techalchemy commented 6 years ago

@uranusjr @jtratner relevant link from Victor Stinner from yesterday regarding 3.7 which will by default open things in utf-8 mode: https://vstinner.github.io/python37-new-utf8-mode.html

vlcinsky commented 6 years ago

Yes, encoding shall be agreed and not dependent on current default encoding. Otherwise we will repeat errors as in distutils or in pbr package.

For Python 2.x compatibility, this ConfigParser snippet from PyMoTW could help.

techalchemy commented 6 years ago

@ncoghlan if you have a sec and a general comment on this one also I’d be super interested

ncoghlan commented 6 years ago

I'd missed that we weren't already enforcing UTF-8 everywhere - we definitely should be.

As @uranusjr notes, that's straightforward for new files and for ASCII-only files, which means the main cases to handle are going to be existing locale encoded files on *nix systems and code page encoding files on Windows.

We're helped here by the fact that PyPI itself restricts package names to ASCII-only identifiers, so non-ASCII characters can only appear in comments, and in non-PyPI package names and URLs.

I figure that just fixing the encoding (for files pipenv was about to edit anyway) is going to be easier than remembering to preserve the original encoding when writing them back, so I think transparently resolving the problem will be straightforward enough to be worth doing.

(@techalchemy Regarding the UTF-8 mode change in 3.7: that's only the default in the POSIX or C locale, although I think it may be the default on Windows in 3.6+)

techalchemy commented 6 years ago

Thanks for the input @ncoghlan I know you have a lot of projects to juggle so I appreciate it. I agree we are handling pretty much universally ascii characters and before yesterday I’m not sure we actually preserved all comments to begin with. So I agree in light of that it would seem worth pursuing (the risk seems low)

uranusjr commented 6 years ago

This reminds me, I also discovered we’re writing CR/LF line endings on Windows. This will need to be normalised as well.

pypa / pipenv

Pipfile should explicitly specify an encoding #1847