lrstanley / vault-unseal

auto-unseal utility for Hashicorp Vault
MIT License
238 stars 32 forks source link

Vault-unseal process high CPU usage #17

Closed medinaLeo closed 2 years ago

medinaLeo commented 2 years ago

๐ŸŒง Describe the problem

Hi,

I have a problem with the vault-unseal process, because it is constantly using 100% of the CPU. Apparently the process works correctly. This service is running in a proxmox linux container (LXC) with centos 7.

The version I have installed of vault-unseal is the latest (0.2.2), but this happened with previous versions.

[root@mypc ~]$ rpm -qa | grep vault-unseal vault-unseal-0.2.2-1.x86_64 [root@mypc ~]$

This is a top capture where you can see the high CPU usage

2022-08-03 16_20_21-ops-gen-01 jam

If you need anything else to diagnose what's going on, I'll be available

Thanks!!!

โ›… Expected behavior

Reduce the high CPU usage of the process

๐Ÿ”„ Minimal reproduction

No response

๐Ÿ’  Version: vault-unseal

v0.2.2

๐Ÿ–ฅ Version: Operating system

linux/centos

โš™ Additional context

No response

๐Ÿค Requirements

lrstanley commented 2 years ago

Could you share your config/flags/etc (with the sensitive stuff stripped)? Curious if there is some default being hit with a timer that causing it to loop too frequently.

medinaLeo commented 2 years ago

Yes, of course, this is my config file

photo_2022-08-04_09-43-46

If you need more information, let me know, Regards

medinaLeo commented 2 years ago

Hello, how are you? Did you see something about this?

Regards.

lrstanley commented 2 years ago

Looking through the code, and your config, I don't see anything that stands out. Could you try changing check_interval to 60s, or similar? As well as if you could provide any logs, as that may help determine if something is running too frequently.

As far as your unseal_tokens, is your server configured for 5 unseal tokens, or only 3?

medinaLeo commented 2 years ago

Hi

In my config file i don't have set check_interval to 60s, but as you can see in the logs, the check_interval by default is already 60s. Do you think is still necesary to change the frecuency to 60s?

In my logs i only see this kind of messages.

Aug 16 13:49:12 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:49:12.288126072Z level=info message="running checks" addr=https://mypc-01:XXXX environment= version=0.2.2
Aug 16 13:49:12 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:49:12.289023545Z level=info message="seal status" addr=https://mypc-01:XXXX environment= version=0.2.2
Aug 16 13:49:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:49:18.078589498Z level=info message="running checks" addr=https://mypc-03:XXXX environment= version=0.2.2
Aug 16 13:49:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:49:18.087721101Z level=info message="seal status" addr=https://mypc-03:XXXX environment= version=0.2.2
Aug 16 13:49:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:49:18.101855823Z level=info message="running checks" addr=https://mypc-02:XXXX environment= version=0.2.2
Aug 16 13:49:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:49:18.1101471Z level=info message="seal status" addr=https://mypc-02:XXXX environment= version=0.2.2
Aug 16 13:50:12 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:50:12.289304419Z level=info message="running checks" addr=https://mypc-01:XXXX environment= version=0.2.2
Aug 16 13:50:12 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:50:12.290365437Z level=info message="seal status" addr=https://mypc-01:XXXX environment= version=0.2.2
Aug 16 13:50:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:50:18.088061798Z level=info message="running checks" addr=https://mypc-03:XXXX environment= version=0.2.2
Aug 16 13:50:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:50:18.089027364Z level=info message="seal status" addr=https://mypc-03:XXXX environment= version=0.2.2
Aug 16 13:50:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:50:18.11686612Z level=info message="running checks" addr=https://mypc-02:XXXX environment= version=0.2.2
Aug 16 13:50:18 mypc-01 vault-unseal[162406]: timestamp=2022-08-16T13:50:18.118480703Z level=info message="seal status" addr=https://mypc-02:XXXX environment= version=0.2.2

Answering your question, i have 3 servers configured with 5 unseal tokens

Regards!!

lrstanley commented 2 years ago

Hmm, that's odd. We don't set the config to 60s by default -- looks like it's actually 30s by default.

I will see if I can replicate later today.

Another item, since you're using LXC, make sure the CPU shares you give to that container are accurate, and proportional to other containers. e.g. one container having something super high will skew the container in question to a much lower CPU share amount, or if you set the shares way below 1000 (the default iirc), that can also cause issues. if you look on the parent (outside of LXC), do you see the vault-unseal process in the process list? what CPU percentages does it show outside of the container?

medinaLeo commented 2 years ago

Ok, if you can replicate the config and you find somethig else, let me know.

About the CPU: Everything is proportional. I use the default values โ€‹โ€‹of proxmox for all configured nodes. photo_2022-08-17_11-31-09

This is a proxmox server. As you can see, the vault-unseal proccess is in the process list. Even though vaul-unseal is written in Go and stracing it might or might not help, I've included a strace process summary just in case

root@pmx-01:~# ps -aux | grep vault-unseal
root       63197  0.0  0.0   8180   668 pts/1    S+   09:35   0:00 grep vault-unseal
100000   1450463  109  0.0 714784 18972 ?        Ssl  Aug03 21922:34 /usr/bin/vault-unseal --config /etc/vault-unseal.yaml
root@pmx-01:~#
root@pmx-01:~#
root@pmx-01:~# strace -cfp 1450463
strace: Process 1450463 attached with 12 threads
strace: pid 189640: entering, ptrace_syscall_info.op == 2
^Cstrace: Process 1450463 detached
strace: Process 1450464 detached
strace: Process 1450465 detached
strace: Process 1450466 detached
strace: Process 1450467 detached
strace: Process 1450468 detached
strace: Process 1450469 detached
strace: Process 1450470 detached
strace: Process 1450984 detached
strace: Process 1473866 detached
strace: Process 2569859 detached
strace: Process 189640 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 43.35    1.297752          10    128131     54345 futex
 32.89    0.984627          27     36029         2 nanosleep
 22.92    0.685997          51     13251           epoll_pwait
  0.24    0.007259           2      2977           write
  0.21    0.006302           7       814           getpid
  0.18    0.005320           1      2977           read
  0.10    0.002915           1      1613           sched_yield
  0.06    0.001681           2       814           tgkill
  0.03    0.000822           1       814        11 rt_sigreturn
  0.02    0.000632           1       520           madvise
  0.01    0.000208         208         1           restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00    2.993515          15    187941     54358 total

root@pmx-01:~#

The CPU percentages does it show outside of the container (proxmox server) is similar to it show in a LXC (more than 100%)

root@pmx-01:~# top
top - 09:38:15 up 58 days, 21:40,  2 users,  load average: 3.37, 2.74, 2.82
Tasks: 407 total,   4 running, 403 sleeping,   0 stopped,   0 zombie
%Cpu(s): 17.5 us,  3.7 sy,  0.0 ni, 78.6 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  32080.9 total,   5371.2 free,   8729.8 used,  17979.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  22848.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1450463 100000    20   0  714784  19236      4 S 110.3   0.1  21925:34 vault-unseal
medinaLeo commented 2 years ago

Hi !!

Did you see anything else?

Regards!!

lrstanley commented 2 years ago

Figured out it was related to a default value for the notification functionality. v0.2.3 was just released with the fix.

medinaLeo commented 2 years ago

Works very good, thx!!!

2022-08-25 12_55_28-ops-gen-01 jam