jgstew / pre-commit-jgstew

custom pre-commit hooks
MIT License
0 stars 0 forks source link

add check that files are UTF8 and NOT ascii format. #2

Closed jgstew closed 1 year ago

jgstew commented 1 year ago

See here: https://stackoverflow.com/a/3269323/861745

The idea is to ensure that files that should be UTF8 actually are UTF8 even if those files should generally only contain ascii characters, which is a different check.

jgstew commented 1 year ago

it might be best to check that the UTF8 BOM is present:

import codecs

encoding = "Unknown"

required_bom = getattr(codecs, "BOM_UTF8")

with open(file_path, "rb") as file:
    header = file.read(len(required_bom))
    if header.startswith(required_bom):
        encoding = "utf-8-bom"

most of the files I care about don't have a BOM, so this doesn't actually work: https://github.com/jgstew/tools/blob/master/Python/file_check_bom.py