saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

salt/utils/files.py is_text — false "is not text" results with UTF-8 #66706

Open NMi-ru opened 4 months ago

NMi-ru commented 4 months ago

Description is_text function gives false negative results (input text is flagged as non-text/binary) when being provided with UTF-8 text with multibyte characters.

Setup

Steps to Reproduce the behavior

An example of bad consequences would be the inability of the file.*'s diff to output a diff of our config file changes:

{{sls}}__files:
  file.recurse:
    - name: /config/bird/
    - source: salt://modules/router-int/files/

protocol-static4.txt

salt-call state.apply …

     Changes:
              ----------
              /config/bird/protocol-static4:
                  ----------
                  diff:
                      Replace text file with binary file

Expected behavior is_text function should return True ("this is text" result) for all multibyte UTF-8 text files.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.1 Python Version: Python: 3.10.14 (main, Apr 3 2024, 21:30:09) [GCC 11.2.0] Dependency Versions: cffi: 1.16.0 cherrypy: 18.8.0 dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.4 libgit2: Not Installed looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 23.1 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.16.0 smmap: Not Installed timelib: 0.3.0 Tornado: 6.3.3 ZMQ: 4.3.4 Salt Package Information: Package Type: onedir System Versions: dist: centos 9 locale: utf-8 machine: x86_64 release: 6.5.13-1-pve system: Linux version: CentOS Stream 9 ```

Additional context My take on what's happening:

Non-ASCII UTF-8 characters (Cyrillic, for example) are multibyte. Example: capital Cyrillic "A" (А) is 0xD0 0x90.

"is_text" function gets its input, snips 512 bytes, then feeds it to the "decode" function:

https://github.com/saltstack/salt/blob/bfc78d7646fd12443337d5840dfb2927dd889f37/salt/utils/files.py#L642

642: def is_text(fp_, blocksize=512):

655: block = fp_.read(blocksize)
or
661: block = fp2_.read(blocksize)

672: block.decode("utf-8")

674: except UnicodeDecodeError:

678: return float(len(nontext)) / len(block) <= 0.30

If we're our of luck, the 512-byte snip cuts our multibyte UTF-8 character in half, leaving only the first (0xD0, for example) character, which leads to invalid UTF-8 byte block (see lines 672/674), which in sequence may lead (with some probability, see line 678) to false "this is not text"/"binary" result.

Attached file (protocol-static4.txt) ends with 0xD0:

dd if=protocol-static4.txt bs=1 count=512 | hexdump
…
00001f0 d0be d0bb d0b3 d0be d0b3 20be d028 d0b4
welcome[bot] commented 4 months ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!