Open xuhcc opened 5 years ago
is there a way for you to narrow down if this occurs when not using gpg?
Does this occur if you use a simple cmd.run state? Or only when you run multiple states in a state run?
Any other information to help replicate?
ssh_exchange_identification: Connection closed by remote host
error occurs even with test.ping
if I set max_procs
to 15-16 or more.
decryption failed
error starts to occur at approximately the same number of processes (17-18). Single cmd.run
state is enough to cause error: salt-ssh '*' state.single cmd.run name='echo 1'
Another observation is that decryption error occurs only when there is enough encrypted items in the pillar (6-7 items or more).
What is the cpu/memory usage look like when you hit this 17-18 number of processes?
also ping @thatch45 i know i was talking to you there other day about salt-ssh and the number of processes it can handle. Does this issue look like a limitation or is this a bug we need to dive into?
I don't think this should be a blocker. There is clearly a Cannot Allocate Memory error, and we don't know the specifics of this system around how big the decryption items are, how much much memory is available and what the overall usage looks like.
@Ch3LL CPU usage is relatively high, memory usage is insignificant (at max_procs=25):
can you share a sanitized version of your configurations and the debug log?
@Ch3LL I published it here https://github.com/xuhcc/salt-test
With this configuration I can reliably produce decryption failed
error on a fresh Arch Linux VM.
thanks for the additional information. it looks like i'm able to replicate the error only sometimes when running state.sls
hQEMA6bfuyRcWr1FAQf/Te4cO54I5s/uTJVa5HhZUL2N9aciskoSqGhneABTHB0Y
cHbzkoh2pGQkRDkurNcLLja4BANLRH0JRgPP7RomocXXzRn+MtYP4brFidO4L55O
0pwxJ5O6lt3tVf/z03WlTSPJqYwUjns5yIQwZoNnB2MkWpKZyLgJUUDYEF4VdAZ0
yBgn734kMf3/oNv971VCXVH9w9uSuijp4q+Eg4BjR7+XHxZU7aTmmxhM543Weztz
tB6b9X4ylhZDD3Ip3k10cHgJW7Hz9ObiDPkdDuuWEdPMsdk6nStDBduEyVkjHstp
vu1KPJ8p3G/VJZZJOzG22g7sZtwbRjVbAvcvYVLQjdJFAXmYjTYO4nPpI5V9zaTv
LDWfRGxZ3j4iPrC9wPNr/bKy4fFfprfb7tKSqgI45yP+Y7DYn02bAJelDTVlVXBp
FG+ZzEGu
=KCMF
-----END PGP MESSAGE-----
, received: gpg: WARNING: unsafe ownership on homedir '/home/ch3ll/git/salt-test/gpgkeys'
[GNUPG:] ENC_TO A6DFBB245C5ABD45 1 0
[GNUPG:] KEY_CONSIDERED E29E9967C11F5E2AA5C69491522FBB10F2981AB6 0
[GNUPG:] KEY_CONSIDERED E29E9967C11F5E2AA5C69491522FBB10F2981AB6 0
gpg: encrypted with 2048-bit RSA key, ID A6DFBB245C5ABD45, created 2019-02-28
"salt-test"
gpg: public key decryption failed: Cannot allocate memory
[GNUPG:] ERROR pkdecrypt_failed 16810070
[GNUPG:] BEGIN_DECRYPTION
[GNUPG:] DECRYPTION_FAILED
gpg: decryption failed: No secret key
[GNUPG:] END_DECRYPTION
I did find this gpg issue https://dev.gnupg.org/T3530 which adds the option to automatically increase secure memory. I believe what is going on is gpg is not able to handle the concurrent connections. if you start your gpg-agent with that option does it resolve the issue?
@Ch3LL I added auto-expand-secmem 0x30000
to my gpg-agent.conf
and it solved the problem, Cannot allocate memory
error no longer occurs. Thank you!
But I think there is one thing that could be improved. gpg-agent may throw various errors, and they are not handled properly by salt. In case of such error, sls is being rendered using raw encrypted text from pillar. For example, when I run this state:
echo_test:
cmd.run:
- name: echo {{ pillar['value1'] }}
With failing gpg-agent I get:
xxx.xxx.xxx.xxx:
- Rendering SLS 'base:echo' failed: could not find expected ':'; line 6
---
echo_test:
cmd.run:
- name: echo -----BEGIN PGP MESSAGE-----
hQEMA24depKIvhDVAQgAj08337qNBVKIonuaaU3Bd0eQM2AwD9rrV0VRgNK47ZAy
W3BXU1KipbynSsW7TpjMp8VCCDyNshbcQ58A2O5wH1QzrTzRXI2ztuMP8APqONrP <======================
s+K4OTrCipfkpqgTLGEq7UP8mBFj2Yw30MBtBnFjMWB7SWFb9NpECCHXjMrXzD7k
ZlniN4UjFBp0pztjPfO1Sn2Ej0fbuHGoRDwY/Mx3fg9xRMdG09sc/A+nxPAwbhTf
EqcfIjgZ5KfqWOgpdj2gim0UY57PLBbsUTyKKyqY05bpAILot3hC/hwvqqDI3Cen
Xc1qeMPid42RbSXEkiSLjhne0xTS+ikw3v/B1SufpdJEAZJuJ+3mrCyCSlEqkzcz
J54XmE0m4gpXzAi1zj/NpXje48T48hHHTGT39pLW4HIB6rKJfQAsN34hWLfRr2Nm
[...]
---
It is particularly dangerous if pillar data is used as content for some file, because this file will be overwritten with encrypted text.
im glad that helped!!! I agree we could handle these errors better, i'll label as bug. thanks for trying that out, i really appreciate it :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.
Not fixed.
Thank you for updating this issue. It is no longer marked as stale.
The gpg-agent.conf workaround doesn't seem to work for me. However it seems, the error only occurs with a certain setting of worker_threads in comibination with a certain amount of minions targeted. In my case I target about 85 minions and have a worker_thread configuration of 20. Actually lowering it to 19 everything works without the error.
I issue something like salt '*' state.apply monitoring test=1
where monitoring is a custom formula with encrypted pillars.
The memory on my machine (using Ubuntu 18.04 server) is not full at the time the errors occur:
free -m
total used free shared buff/cache available
Mem: 36183 15141 8985 72 12056 20528
Swap: 1951 0 1951
The errors actually are just warnings, saying something like:
[WARNING ][18462] Could not decrypt cipher '-----BEGIN PGP MESSAGE-----...END PGP MESSAGE-----', received: 'gpg: invalid armor header: ...[GNUPG:] ENC_TO ... 1 0\n[GNUPG:] KEY_CONSIDERED ... 0\n[GNUPG:] KEY_CONSIDERED ... 0\ngpg: encrypted with 3072-bit RSA key, ID ..., created 2019-12-10\n "..."\ngpg: public key decryption failed: Cannot allocate memory\n[GNUPG:] ERROR pkdecrypt_failed 16810070\n[GNUPG:] BEGIN_DECRYPTION\n[GNUPG:] DECRYPTION_FAILED\ngpg: decryption failed: No secret key\n[GNUPG:] END_DECRYPTION\n'
We are seeing the same problem on 18.04.
:cone-of-shame: we are on 2018.3.5 - literally were waiting until we finished a data center move that is in progress.
so, this problem doesn't exist on 16.04. anyone know how to get 18.04 to NOT use gpg-agent? 16.04 doesn't.
I looked at the salt source and diffed and searched the python-gnupg modules from the different ubuntu versions
it defaults to use_agent=false
so I can't see why it is using gpg-agent at all.
Anyone know how to get it to not use gpg-agent?
For the record, https://github.com/saltstack/salt/commit/ef56aa0547ee4e22ca1d672798a0fbcaf2766b1b introduces a new gpg_decrypt_must_succeed
option that prevents un-decrypted pillar data from getting used.
Description of Issue/Question
When I'm applying states using salt-ssh with a big enough roster (53 machines in my case), and default
max_procs
(25), I usually get severalFailed to return clean data
errors in the beginning of the execution. Recently I started to put encrypted data in pillar, and now I get gpg decryption errors:While connection usually fails only for few targets, decryption error occurs for all of them, and state execution continues after that, so target machines receive encrypted data instead of decrypted. For example, instead of the password some config will contain encrypted message.
It is not clear whether decryption errors are related to connection errors or not. But both occur only with large number of workers, and could be prevented by setting
max_procs
to lower value such as 8. There areCannot allocate memory
messages in the log, but memory is not exhausted.Setup
pillar:
state:
Versions Report