Open NMi-ru opened 4 months ago
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!
Description is_text function gives false negative results (input text is flagged as non-text/binary) when being provided with UTF-8 text with multibyte characters.
Setup
Steps to Reproduce the behavior
An example of bad consequences would be the inability of the file.*'s diff to output a diff of our config file changes:
protocol-static4.txt
Expected behavior is_text function should return True ("this is text" result) for all multibyte UTF-8 text files.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.1 Python Version: Python: 3.10.14 (main, Apr 3 2024, 21:30:09) [GCC 11.2.0] Dependency Versions: cffi: 1.16.0 cherrypy: 18.8.0 dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.4 libgit2: Not Installed looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 23.1 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.16.0 smmap: Not Installed timelib: 0.3.0 Tornado: 6.3.3 ZMQ: 4.3.4 Salt Package Information: Package Type: onedir System Versions: dist: centos 9 locale: utf-8 machine: x86_64 release: 6.5.13-1-pve system: Linux version: CentOS Stream 9 ```Additional context My take on what's happening:
Non-ASCII UTF-8 characters (Cyrillic, for example) are multibyte. Example: capital Cyrillic "A" (А) is 0xD0 0x90.
"is_text" function gets its input, snips 512 bytes, then feeds it to the "decode" function:
https://github.com/saltstack/salt/blob/bfc78d7646fd12443337d5840dfb2927dd889f37/salt/utils/files.py#L642
If we're our of luck, the 512-byte snip cuts our multibyte UTF-8 character in half, leaving only the first (0xD0, for example) character, which leads to invalid UTF-8 byte block (see lines 672/674), which in sequence may lead (with some probability, see line 678) to false "this is not text"/"binary" result.
Attached file (protocol-static4.txt) ends with 0xD0: