pre-commit / identify

File identification library for Python
MIT License
255 stars 142 forks source link

Some .gzip files identified as plaintext AND binary #450

Closed SebastianSchildt closed 2 months ago

SebastianSchildt commented 7 months ago

We observed a weird behavior on some gzip files.

Good example:

https://github.com/eclipse/kuksa.val/raw/0.4.2/kuksa_databroker/createbom/licensestore/Apache-2.0.txt.gz

Bad example

https://raw.githubusercontent.com/eclipse/kuksa.val/0.4.2/kuksa_databroker/createbom/licensestore/ring.LICENSE.txt.gz

Used identify version

pip install identify
Collecting identify
  Downloading identify-2.5.35-py2.py3-none-any.whl.metadata (4.4 kB)
Downloading identify-2.5.35-py2.py3-none-any.whl (98 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.9/98.9 kB 132.8 kB/s eta 0:00:00
Installing collected packages: identify
Successfully installed identify-2.5.35

The "good" file works as expected

$ identify-cli Apache-2.0.txt.gz                                                                               
["binary", "file", "gzip", "non-executable"]

the "bad" one yields unexpected results

$ identify-cli ring.LICENSE.txt.gz
["binary", "file", "gzip", "non-executable", "plain-text", "text"]

For reference, file says

$ file *.gz 
Apache-2.0.txt.gz:   gzip compressed data, was "Apache2.txt", last modified: Tue Nov  8 17:08:57 2022, from Unix, original size modulo 2^32 11356
ring.LICENSE.txt.gz: gzip compressed data, was "ring.LICENSE.txt", last modified: Tue Feb 14 08:21:40 2023, from Unix, original size modulo 2^32 10125

is this a bug?

asottile commented 2 months ago

the LICENSE file is triggering the conventional patterns