saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.13k stars 5.47k forks source link

gpg decryption fails with large number of SSH connections #51356

Open xuhcc opened 5 years ago

xuhcc commented 5 years ago

Description of Issue/Question

When I'm applying states using salt-ssh with a big enough roster (53 machines in my case), and default max_procs (25), I usually get several Failed to return clean data errors in the beginning of the execution. Recently I started to put encrypted data in pillar, and now I get gpg decryption errors:

$ salt-ssh '*' state.apply
server1:
    ----------
    _error:
        Failed to return clean data
    retcode:
        255
    stderr:
        ssh_exchange_identification: Connection closed by remote host
        ssh_exchange_identification: Connection closed by remote host
    stdout:
server1:
    ----------
    _error:
        Failed to return clean data
    retcode:
        255
    stderr:
        ssh_exchange_identification: Connection closed by remote host
        ssh_exchange_identification: Connection closed by remote host
    stdout:

[WARNING ] Could not decrypt cipher -----BEGIN PGP MESSAGE-----

.....
-----END PGP MESSAGE-----
, received: [GNUPG:] ENC_TO .....
[GNUPG:] KEY_CONSIDERED .....
[GNUPG:] KEY_CONSIDERED .....
gpg: encrypted with 2048-bit RSA key, ID ....., created 2019-01-17
      "...."
gpg: public key decryption failed: Cannot allocate memory
[GNUPG:] ERROR pkdecrypt_failed 16810070
[GNUPG:] BEGIN_DECRYPTION
[GNUPG:] DECRYPTION_FAILED
gpg: decryption failed: No secret key
[GNUPG:] END_DECRYPTION

While connection usually fails only for few targets, decryption error occurs for all of them, and state execution continues after that, so target machines receive encrypted data instead of decrypted. For example, instead of the password some config will contain encrypted message.

It is not clear whether decryption errors are related to connection errors or not. But both occur only with large number of workers, and could be prevented by setting max_procs to lower value such as 8. There are Cannot allocate memory messages in the log, but memory is not exhausted.

Setup

pillar:

#!yaml|gpg

smtp:
  password: |
    -----BEGIN PGP MESSAGE-----
    .....
    -----END PGP MESSAGE-----

state:

test_state:
   cmd.run:
    - name: 'echo "{{ pillar['smtp']['password'] }}"'

Versions Report

$ salt-ssh --versions-report
Salt Version:
           Salt: 2018.3.3

Dependency Versions:
           cffi: 1.11.5
       cherrypy: Not Installed
       dateutil: 2.7.5
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.10
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: 0.30.1
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.0
   mysql-python: Not Installed
      pycparser: 2.19
       pycrypto: 3.7.3
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.15 (default, Jan 10 2019, 23:20:52)
   python-gnupg: Not Installed
         PyYAML: 3.13
          PyZMQ: 17.1.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 5.1.1
            ZMQ: 4.3.1

System Versions:
           dist:   
         locale: UTF-8
        machine: x86_64
        release: 4.20.3-arch1-1-ARCH
         system: Linux
        version: Not Installed
Ch3LL commented 5 years ago

is there a way for you to narrow down if this occurs when not using gpg?

Does this occur if you use a simple cmd.run state? Or only when you run multiple states in a state run?

Any other information to help replicate?

xuhcc commented 5 years ago

ssh_exchange_identification: Connection closed by remote host error occurs even with test.ping if I set max_procs to 15-16 or more.

decryption failed error starts to occur at approximately the same number of processes (17-18). Single cmd.run state is enough to cause error: salt-ssh '*' state.single cmd.run name='echo 1'

Another observation is that decryption error occurs only when there is enough encrypted items in the pillar (6-7 items or more).

Ch3LL commented 5 years ago

What is the cpu/memory usage look like when you hit this 17-18 number of processes?

Ch3LL commented 5 years ago

also ping @thatch45 i know i was talking to you there other day about salt-ssh and the number of processes it can handle. Does this issue look like a limitation or is this a bug we need to dive into?

thatch45 commented 5 years ago

I don't think this should be a blocker. There is clearly a Cannot Allocate Memory error, and we don't know the specifics of this system around how big the decryption items are, how much much memory is available and what the overall usage looks like.

xuhcc commented 5 years ago

@Ch3LL CPU usage is relatively high, memory usage is insignificant (at max_procs=25):

Ch3LL commented 5 years ago

can you share a sanitized version of your configurations and the debug log?

xuhcc commented 5 years ago

@Ch3LL I published it here https://github.com/xuhcc/salt-test With this configuration I can reliably produce decryption failed error on a fresh Arch Linux VM.

Ch3LL commented 5 years ago

thanks for the additional information. it looks like i'm able to replicate the error only sometimes when running state.sls


hQEMA6bfuyRcWr1FAQf/Te4cO54I5s/uTJVa5HhZUL2N9aciskoSqGhneABTHB0Y                                                                                                   
cHbzkoh2pGQkRDkurNcLLja4BANLRH0JRgPP7RomocXXzRn+MtYP4brFidO4L55O                                                                                                   
0pwxJ5O6lt3tVf/z03WlTSPJqYwUjns5yIQwZoNnB2MkWpKZyLgJUUDYEF4VdAZ0                                                                                                   
yBgn734kMf3/oNv971VCXVH9w9uSuijp4q+Eg4BjR7+XHxZU7aTmmxhM543Weztz                                                                                                   
tB6b9X4ylhZDD3Ip3k10cHgJW7Hz9ObiDPkdDuuWEdPMsdk6nStDBduEyVkjHstp                                                                                                   
vu1KPJ8p3G/VJZZJOzG22g7sZtwbRjVbAvcvYVLQjdJFAXmYjTYO4nPpI5V9zaTv                                                                                                   
LDWfRGxZ3j4iPrC9wPNr/bKy4fFfprfb7tKSqgI45yP+Y7DYn02bAJelDTVlVXBp                                                                                                   
FG+ZzEGu                                                                                                                                                           
=KCMF                                                                                                                                                              
-----END PGP MESSAGE-----                                                                                                                                          
, received: gpg: WARNING: unsafe ownership on homedir '/home/ch3ll/git/salt-test/gpgkeys'                                                                          
[GNUPG:] ENC_TO A6DFBB245C5ABD45 1 0                                                                                                                               
[GNUPG:] KEY_CONSIDERED E29E9967C11F5E2AA5C69491522FBB10F2981AB6 0                                                                                                 
[GNUPG:] KEY_CONSIDERED E29E9967C11F5E2AA5C69491522FBB10F2981AB6 0                                                                                                 
gpg: encrypted with 2048-bit RSA key, ID A6DFBB245C5ABD45, created 2019-02-28                                                                                      
      "salt-test"                                                                                                                                                  
gpg: public key decryption failed: Cannot allocate memory                                                                                                          
[GNUPG:] ERROR pkdecrypt_failed 16810070                                                                                                                           
[GNUPG:] BEGIN_DECRYPTION                                                                                                                                          
[GNUPG:] DECRYPTION_FAILED                                                                                                                                         
gpg: decryption failed: No secret key                                                                                                                              
[GNUPG:] END_DECRYPTION

I did find this gpg issue https://dev.gnupg.org/T3530 which adds the option to automatically increase secure memory. I believe what is going on is gpg is not able to handle the concurrent connections. if you start your gpg-agent with that option does it resolve the issue?

xuhcc commented 5 years ago

@Ch3LL I added auto-expand-secmem 0x30000 to my gpg-agent.conf and it solved the problem, Cannot allocate memory error no longer occurs. Thank you!

But I think there is one thing that could be improved. gpg-agent may throw various errors, and they are not handled properly by salt. In case of such error, sls is being rendered using raw encrypted text from pillar. For example, when I run this state:

echo_test:
  cmd.run:
    - name: echo {{ pillar['value1'] }}

With failing gpg-agent I get:

xxx.xxx.xxx.xxx:
    - Rendering SLS 'base:echo' failed: could not find expected ':'; line 6

      ---
      echo_test:
        cmd.run:
          - name: echo -----BEGIN PGP MESSAGE-----

      hQEMA24depKIvhDVAQgAj08337qNBVKIonuaaU3Bd0eQM2AwD9rrV0VRgNK47ZAy
      W3BXU1KipbynSsW7TpjMp8VCCDyNshbcQ58A2O5wH1QzrTzRXI2ztuMP8APqONrP    <======================
      s+K4OTrCipfkpqgTLGEq7UP8mBFj2Yw30MBtBnFjMWB7SWFb9NpECCHXjMrXzD7k
      ZlniN4UjFBp0pztjPfO1Sn2Ej0fbuHGoRDwY/Mx3fg9xRMdG09sc/A+nxPAwbhTf
      EqcfIjgZ5KfqWOgpdj2gim0UY57PLBbsUTyKKyqY05bpAILot3hC/hwvqqDI3Cen
      Xc1qeMPid42RbSXEkiSLjhne0xTS+ikw3v/B1SufpdJEAZJuJ+3mrCyCSlEqkzcz
      J54XmE0m4gpXzAi1zj/NpXje48T48hHHTGT39pLW4HIB6rKJfQAsN34hWLfRr2Nm
      [...]
      ---

It is particularly dangerous if pillar data is used as content for some file, because this file will be overwritten with encrypted text.

Ch3LL commented 5 years ago

im glad that helped!!! I agree we could handle these errors better, i'll label as bug. thanks for trying that out, i really appreciate it :)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

xuhcc commented 4 years ago

Not fixed.

stale[bot] commented 4 years ago

Thank you for updating this issue. It is no longer marked as stale.

Heiko-san commented 4 years ago

The gpg-agent.conf workaround doesn't seem to work for me. However it seems, the error only occurs with a certain setting of worker_threads in comibination with a certain amount of minions targeted. In my case I target about 85 minions and have a worker_thread configuration of 20. Actually lowering it to 19 everything works without the error.

I issue something like salt '*' state.apply monitoring test=1 where monitoring is a custom formula with encrypted pillars.

The memory on my machine (using Ubuntu 18.04 server) is not full at the time the errors occur:

free -m
              total        used        free      shared  buff/cache   available
Mem:          36183       15141        8985          72       12056       20528
Swap:          1951           0        1951

The errors actually are just warnings, saying something like:

[WARNING ][18462] Could not decrypt cipher '-----BEGIN PGP MESSAGE-----...END PGP MESSAGE-----', received: 'gpg: invalid armor header: ...[GNUPG:] ENC_TO ... 1 0\n[GNUPG:] KEY_CONSIDERED ... 0\n[GNUPG:] KEY_CONSIDERED ... 0\ngpg: encrypted with 3072-bit RSA key, ID ..., created 2019-12-10\n      "..."\ngpg: public key decryption failed: Cannot allocate memory\n[GNUPG:] ERROR pkdecrypt_failed 16810070\n[GNUPG:] BEGIN_DECRYPTION\n[GNUPG:] DECRYPTION_FAILED\ngpg: decryption failed: No secret key\n[GNUPG:] END_DECRYPTION\n'
wwalker commented 3 years ago

We are seeing the same problem on 18.04.

:cone-of-shame: we are on 2018.3.5 - literally were waiting until we finished a data center move that is in progress.

so, this problem doesn't exist on 16.04. anyone know how to get 18.04 to NOT use gpg-agent? 16.04 doesn't.

I looked at the salt source and diffed and searched the python-gnupg modules from the different ubuntu versions

it defaults to use_agent=false so I can't see why it is using gpg-agent at all.

Anyone know how to get it to not use gpg-agent?

jfly commented 1 year ago

For the record, https://github.com/saltstack/salt/commit/ef56aa0547ee4e22ca1d672798a0fbcaf2766b1b introduces a new gpg_decrypt_must_succeed option that prevents un-decrypted pillar data from getting used.