saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

Windows: UCS-2 aka UTF-16 textfiles are treated as binary files by file.replace #52793

Open dhs-rec opened 5 years ago

dhs-rec commented 5 years ago

Description of Issue

When trying to run file.replace on a UCS-2 encoded file on Windows, it errors out with the following message:

----------
          ID: patch_minimum_ps_version
    Function: file.replace
        Name: C:/Program Files/WindowsPowerShell/Modules/PSWindowsUpdate/PSWindowsUpdate.psd1
      Result: False
     Comment: An exception occurred in this state: Traceback (most recent call last):
                File "c:\salt\bin\lib\site-packages\salt\state.py", line 1919, in call
                  **cdata['kwargs'])
                File "c:\salt\bin\lib\site-packages\salt\loader.py", line 1918, in wrapper
                  return f(*args, **kwargs)
                File "c:\salt\bin\lib\site-packages\salt\states\file.py", line 4331, in replace
                  backslash_literal=backslash_literal)
                File "c:\salt\bin\lib\site-packages\salt\modules\file.py", line 2229, in replace
                  .format(path)
              salt.exceptions.SaltInvocationError: Cannot perform string replacements on a binary file: C:\Program Files\WindowsPowerShell\Modules\PSWindowsUpdate\PSWindowsUpdate.psd1
     Started: 14:12:03.685800
    Duration: 15.6 ms
     Changes:

Instead of doing this, file.replace should support specifying the correct encoding for the file, for example:

patch_minimum_ps_version:
  file.replace:
    - name: C:/Program Files/WindowsPowerShell/Modules/PSWindowsUpdate/PSWindowsUpdate.psd1
    - pattern: PowerShellVersion\s+=\s+.*
    - repl: PowerShellVersion = '3.0'
    - encoding: utf16

The following Python(3) example does this (and even automatically writes a BOM):

#!/usr/bin/python3

import re

file = 'PSWindowsUpdate.psd1'

with open(file, 'r', encoding='utf16') as pswu:
  text = pswu.read()

with open(file + '.new', 'w', encoding='utf16') as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\r\n')
    else:
      f.write(line + '\r\n')

Versions Report

> salt-call --versions-report
Salt Version:
           Salt: 2018.3.4

Dependency Versions:
           cffi: 1.10.0
       cherrypy: 10.2.1
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: 2.0.5
      gitpython: 2.1.3
          ioflo: Not Installed
         Jinja2: 2.9.6
        libgit2: Not Installed
        libnacl: 1.6.1
       M2Crypto: Not Installed
           Mako: 1.0.6
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.17
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)]
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.3
           RAET: Not Installed
          smmap: 2.0.5
        timelib: 0.2.4
        Tornado: 4.5.1
            ZMQ: 4.1.6

System Versions:
           dist:
         locale: cp1252
        machine: AMD64
        release: 2008ServerR2
         system: Windows
        version: 2008ServerR2 6.1.7601 SP1 Multiprocessor Free
dhs-rec commented 5 years ago

Might even be possible to determine the encoding automatically. At least the file command on Linux is able to do so:

% file PSWindowsUpdate.psd1 
PSWindowsUpdate.psd1: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
dhs-rec commented 5 years ago

The following, slightly modified version of above sample code does this, using libmagic Python bindings (aka. python3-magic):

#!/usr/bin/python3

import magic
import re

file = 'PSWindowsUpdate.psd1'

# Open file in binary mode and detect encoding
text = open(file, 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(text)

# Read and write file with correct encoding
text = open(file, 'r', encoding=encoding).read()
with open(file + '.new', 'w', encoding=encoding) as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\r\n')
    else:
      f.write(line + '\r\n')
waynew commented 5 years ago

I, for one, would like to overhaul a lot of how we handle file/stream encodings, with the ability to specify all the things (and perhaps default to utf-8, if possible).

dhs-rec commented 5 years ago

Sounds good, although I'd consider this a bug.

Anyway, here's a version of the sample code using chardet, which might be more platform independent:

#!/usr/bin/python3

import chardet
import re

file = 'PSWindowsUpdate.psd1'

# Detect character encoding
charenc = chardet.detect(open(file, 'rb').read())['encoding']
print(charenc)

# Read file and detect linefeed type
with open(file, 'r', encoding=charenc) as f:
  text = f.read()
  lf = f.newlines

# Write new file, replacing some text
with open(file + '.new', 'w', encoding=charenc, newline=lf) as f:
  for line in text.splitlines():
    if re.match('^PowerShellVersion\s+=\s+.*', line):
      f.write('PowerShellVersion = \'3.0\'\n')
    else:
      f.write(line + '\n')
dhs-rec commented 5 years ago

Any news on this bug?

waynew commented 5 years ago

Nothing currently - we're currently pushing hard to stabilize our test suites/pipeline, but this is definitely on my radar.

My personal opinion is that Salt should never assume encoding (though right now we assume utf-8, if we assume anything). I think I'd welcome a PR that adds an encoding argument to file.replace, if it doesn't require a more invasive change. Otherwise I think we should open a SEP detailing the work required/risks to updating the way Salt handles encoding things.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

dhs-rec commented 4 years ago

Ping.

stale[bot] commented 4 years ago

Thank you for updating this issue. It is no longer marked as stale.

amalaguti commented 1 year ago

file.managed seems to have a similar issue, trying to manage a file encoded utf-16 LE (xml export from Windows Task Scheduler), added some Jinja to templatize the file, but is treated as binary. not replacing jinja by defaults values

get_xml:
  file.managed:
    - name: 'C:\task.xml'
    - source: salt://win_tasks/some_template.txt
    - template: jinja
    # - encoding: tried with utf16, utf16-le, utf-16-le 
    - defaults:
        TimeTrigger_StartBoundary: {{ TimeTrigger_StartBoundary }}
        TimeTrigger_StopBoundary: {{ TimeTrigger_StopBoundary }}       
some_template.txt (UTF-16LE encoded)
      <StartBoundary>{{ TimeTrigger_StartBoundary }}</StartBoundary>
      <EndBoundary>{{ TimeTrigger_StopBoundary }}</EndBoundary>
      <ExecutionTimeLimit>PT30M</ExecutionTimeLimit>

This causes the file to be treated as binary and Jinja does not get replaced. The file looks exactly like original

minion-win-1:
----------
          ID: get_xml
    Function: file.managed
        Name: C:\task.xml
      Result: True
     Comment: File C:\task.xml updated
     Started: 17:39:38.661603
    Duration: 140.625 ms
     Changes:
              ----------
              diff:
                  Replace text file with binary file

As soon encoding of the file is changed to utf-8, the file is replaced by text file and jinja values are replaced.