src-d / style-analyzer

Lookout Style Analyzer: fixing code formatting and typos during code reviews
GNU Affero General Public License v3.0
32 stars 21 forks source link

Specify UTF-8 as encoding for open-related methods #744

Open m09 opened 5 years ago

m09 commented 5 years ago

ATM we often don't specify the encoding in open calls. We should instead specify utf-8 everywhere because the default is platform dependent:

encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

That means all our open, pathlib.Path.read_text and related calls should specify encoding="utf-8".

vmarkovtsev commented 5 years ago

There is an alternative: validate sys.getdefaultencoding() at import time and if it is not utf-8, print the workaround (LC_ALL=UTF-8 python3 script.py or export PYTHONIOENCODING=utf-8) and exit immediately.

I like it more because:

  1. The majority of the third-party dependencies do not specify encoding for open.
  2. It is very easy to miss encoding and quite hard to maintain, unless there is a flake8 plugin.
  3. My proposal is much easier to implement.
m09 commented 5 years ago

The majority of the third-party dependencies do not specify encoding for open.

bblfsh, the main dependency we use to parse the files, enforces utf-8 for its input.

It is very easy to miss encoding and quite hard to maintain, unless there is a flake8 plugin. My proposal is much easier to implement.

Indeed I checked quickly and it seems there is no existing flake option. It makes for a great OSD though. I agree that the proposed fix is faster to implement but I also think it's worse: we can fix the problem directly and instead make using our tool harder to use in some cases.