pre-commit / pre-commit-hooks

Some out-of-the-box hooks for pre-commit
MIT License
5.4k stars 710 forks source link

Add UTF-16 support for check-yaml hook #1087

Closed logan-connolly closed 3 months ago

logan-connolly commented 3 months ago

Hi @asottile,

First off thanks for all the hard work you put into this project and all your others.

The issue

I wanted to raise an issue that my team faced when running the check-yaml hook on utf-16 encoded files. According to the YAML spec, utf-16 files are supported:

On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. (§ 5.2. Character Encodings)

But in the check-yaml precommit hook, the encoding is hardcoded to UTF-8:

https://github.com/pre-commit/pre-commit-hooks/blob/429455474be018c8f085ae7d312432ab4154d5a2/pre_commit_hooks/check_yaml.py#L63

which will raise the following error when a utf-16 file is passed:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Potential solutions

  1. Ugly but safe, explicitly opening utf-8 and utf-16 encoded files and handling UnicodeError:
         try:
             with open(filename, encoding='UTF-8') as f:
                 load_fn(f)
+        except UnicodeError:
+            try:
+                with open(filename, encoding='UTF-16') as f:
+                    load_fn(f)
+            except ruamel.yaml.YAMLError as exc:
+                print(exc)
+                retval = 1
         except ruamel.yaml.YAMLError as exc:
             print(exc)
             retval = 1
  1. Relax the encoding argument (it could have unintentional consequences by not enforcing the encoding though):
     retval = 0
     for filename in args.filenames:
         try:
-            with open(filename, encoding='UTF-8') as f:
+            with open(filename, mode="rb") as f:
                 load_fn(f)
         except ruamel.yaml.YAMLError as exc:
             print(exc)

There may be a better approach out there, but these are the two that I initially came up with. If you are happy with any of them, I would be happy to open a PR with the change and add additional utf-16 test cases to check-yaml.

Related issues

I tried to search the existing issues to make sure this was not a duplicate, but could not. However, these are some related issues that I found that may be relevant to the discussion:

asottile commented 3 months ago

no thanks, utf 16 is usually a mistake and I'm not willing to support it