jumanjihouse / pre-commit-hooks

git pre-commit hooks that work with http://pre-commit.com/
MIT License
114 stars 52 forks source link

`require-ascii` doesn’t do what it says on the tin #104

Open Jayman2000 opened 2 years ago

Jayman2000 commented 2 years ago

According to the README:

require-ascii

What it does

Requires that text files have ascii-encoding, including the extended ascii set. This is useful to detect files that have unicode characters.

require-ascii will fail on files that are encoded in extended ASCII if:

  1. the file uses characters in the 128–255 range, and
  2. those characters aren’t followed by other characters that coincidentally make the sequence valid UTF-8 (see this table).

This script will generate a bunch of files that contain valid extended ASCII but fail when tested by require-ascii:

# The README links to <https://theasciicode.com.ar/>. There's many different
# ways you could extend ASCII, but that site in particular says "In 1981,
# IBM developed an extension of 8-bit ASCII code, called 'code page 437'..."
extended_ascii = "cp437"

for code_point in range(128, 256):
    # Create a file that should pass require-ascii, but won't.
    with open(f"{code_point}.cp437.txt", mode='wb') as file:
        file.write(code_point.to_bytes(1, 'little'))
    # Make sure that that file really does contain valid extended ASCII.
    with open(f"{code_point}.cp437.txt", mode='rt', encoding=extended_ascii) as file:
        # This should cause a UnicodeDecodeError if file contains
        # invalid extended ASCII.
        file.read()

A more accurate description of require-ascii would be:

require-ascii

What it does

Requires that text files use UTF-8 and only use code points ≤ 255.