First off thanks for all the hard work you put into this project and all your others.
The issue
I wanted to raise an issue that my team faced when running the check-yaml hook on utf-16 encoded files. According to the YAML spec, utf-16 files are supported:
On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. (§ 5.2. Character Encodings)
But in the check-yaml precommit hook, the encoding is hardcoded to UTF-8:
which will raise the following error when a utf-16 file is passed:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Potential solutions
Ugly but safe, explicitly opening utf-8 and utf-16 encoded files and handling UnicodeError:
try:
with open(filename, encoding='UTF-8') as f:
load_fn(f)
+ except UnicodeError:
+ try:
+ with open(filename, encoding='UTF-16') as f:
+ load_fn(f)
+ except ruamel.yaml.YAMLError as exc:
+ print(exc)
+ retval = 1
except ruamel.yaml.YAMLError as exc:
print(exc)
retval = 1
Relax the encoding argument (it could have unintentional consequences by not enforcing the encoding though):
retval = 0
for filename in args.filenames:
try:
- with open(filename, encoding='UTF-8') as f:
+ with open(filename, mode="rb") as f:
load_fn(f)
except ruamel.yaml.YAMLError as exc:
print(exc)
There may be a better approach out there, but these are the two that I initially came up with. If you are happy with any of them, I would be happy to open a PR with the change and add additional utf-16 test cases to check-yaml.
Related issues
I tried to search the existing issues to make sure this was not a duplicate, but could not. However, these are some related issues that I found that may be relevant to the discussion:
https://github.com/pre-commit/pre-commit-hooks/issues/468 and others regarding the upstream parser: I think this is not an upstream issue, but in the pre-commit hook. When I adapt the check-yaml hook with any of the solutions above, the utf-16 files are successfully parsed and validated.
Hi @asottile,
First off thanks for all the hard work you put into this project and all your others.
The issue
I wanted to raise an issue that my team faced when running the
check-yaml
hook onutf-16
encoded files. According to the YAML spec,utf-16
files are supported:But in the
check-yaml
precommit hook, the encoding is hardcoded toUTF-8
:https://github.com/pre-commit/pre-commit-hooks/blob/429455474be018c8f085ae7d312432ab4154d5a2/pre_commit_hooks/check_yaml.py#L63
which will raise the following error when a
utf-16
file is passed:Potential solutions
utf-8
andutf-16
encoded files and handlingUnicodeError
:There may be a better approach out there, but these are the two that I initially came up with. If you are happy with any of them, I would be happy to open a PR with the change and add additional
utf-16
test cases to check-yaml.Related issues
I tried to search the existing issues to make sure this was not a duplicate, but could not. However, these are some related issues that I found that may be relevant to the discussion:
utf-16
is not supported by the JSON spec, it is supported in YAMLcheck-yaml
hook with any of the solutions above, theutf-16
files are successfully parsed and validated.