Improve discover script execution time

bekopharm commented 4 years ago

The discover script is unreasonable slow. While it works okay for a few pools it scratches at the default 3s timeout of zabbix-server and zabbix-agent with ~20 pools _having statuspath already and results often in ZBX_NOTSUPPORTED: Timeout while executing a shell script.

Here are some typical discovery round trips with 20 pools _having statuspath (server has ~200 pool processes in total going) for me:

0m2.550s
0m2.662s
0m4.018s
0m2.504s
0m2.533s
0m2.771s
0m2.958s
0m2.456s
0m2.947s
0m2.502s

And here the same executed locally (so it's hardly network latency and ZBX overhead)

0m2.578s
0m2.378s
0m2.366s
0m2.460s
0m2.379s
0m2.367s
0m2.544s
0m2.439s
0m2.328s
0m2.431s
0m2.470s

Otherwise it's a great script 👍. I'm afraid to use this on a server that has even more pools going tho. I mean mileage may vary - this server has load after all.

rvalitov commented 4 years ago

Hi! We should find where the problem happens. I can only think of the following command, please run it and check its execution speed:

ps aux | grep "php-fpm"

Also, please, give me some information about the CPU load and IO load, just to get an idea how loaded your server is. Utils like htop and iotop can help. Alternatively, you can get the values from Zabbix monitoring.

bekopharm commented 4 years ago

time ps aux | grep "php-fpm" real 0m0.026s

time pgrep "php-fpm" real 0m0.015s

time /etc/zabbix/zabbix_php_fpm_discovery.sh real 0m2.520s

load average: 4,70, 4,51, 4,75 iowait 10.3738 % (Catched it during backup and it's a network share in another center)

rvalitov commented 4 years ago

Thank you! Originally I thought that there could be a problem with ps (like this and similar), but that's not the case.

I guess the only reason this problem happens is that the discovery script queries each detected pool to confirm that it's functional and to obtain its process manager type (ondemand, dynamic, static). If you have lots of pools then this operation may take some time.

The approach above ensures that all the pools that you see in Zabbix really report status information. The pools that do not report the information are ignored and are not shown in Zabbix at all. I thought that this is useful, because in this case you don't see errors in Zabbix for the pools that you don't want to monitor (i.e. where the status page is disabled). But it seems that such queries require some time.

I have currently 2 ideas here:

either disable the query and checks during the discovery
or add additional background script (cron job) that will actually scan and analyze the pools (and this script can work more than 3 sec if required), while the discovery script will just report the cached data obtained from the primary background script.

While the last idea seems logical, I don't want to work with cron, because it will make the configuration of this template more complicated. But I don't see any alternative solution that is simple and elegant.

bekopharm commented 4 years ago

I've the source not open but are you piping it through sort and uniq? Some pools may double.

rvalitov commented 4 years ago

Yes, there are no duplicates. You can also check that by running the script manually with debug mode enabled.

rvalitov commented 4 years ago

Just for information: I've been thinking about this issue and how to improve it. A new idea is to use intermediate cache and the script will perform a self check of execution time. If the execution time goes near 3 sec then it will give the results it has already processed and will save its execution state. Next time the discovery script is launched it will restore its execution state and will continue from that point where it stopped.

So, now I'm ready to move on from the thinking phase to actual programming.

rvalitov commented 4 years ago

@bekopharm Some information for you. I could reproduce this issue, in my tests I used 400 pools via Travis CI, which is now used for testing of this project. The work to fix this issue is ongoing.

rvalitov commented 4 years ago

/remind me this issue in 10 days

reminders[bot] commented 4 years ago

@rvalitov set a reminder for Jul 11th 2020

rvalitov commented 4 years ago

A quick workaround with the current template would be to modify the Timeout option of the agent (and probably on the server):

rvalitov commented 4 years ago

@bekopharm I made a new pull request #42 that should solve this issue. Can you please test it? You should just download the new files from that pull request. Kindly waiting for your feedback, thank you!

reminders[bot] commented 4 years ago

:wave: @rvalitov, this issue

bekopharm commented 4 years ago

Going to check this week @rvalitov. Drowning in work atm but I've seen it :+1:

bekopharm commented 4 years ago

@rvalitov Thanks for your patience. I got around testing it today. This branch yields [after installing bc]:

mapfile: -d: invalid option

Issue is Ubuntu 16.04.6 LTS Bash is GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)

mapfile: usage: mapfile [-n count] [-O origin] [-s count] [-t] [-u fd] [-C callback] [-c quantum] [array]

Also I'm not happy with the cache files at $(${S_DIRNAME} "$0") because this results (according to your provided script location in userparameter_php_fpm.conf) to an /etc folder. While there should be no executable located in /etc anyway it's definitely the wrong location for cache. Thinking about ReadOnly Root and similar here.

rvalitov commented 4 years ago

@bekopharm Thank you so much for the testing!

Indeed, bc is required, the Wiki will be updated after the new code will be merged.

The error with mapfile happens because the bash is old and does not support this option. It seems that bash version 4.5 or newer is required for it to work. I will rewrite the code to support older bash. And will also add test cases for older Ubuntu system. Currently I've been testing only for Ubuntu 18 Bionic and later.

Regarding the cache file. If you have correctly installed the script files, i.e. copied them to correct location: /etc/zabbix/, then the cache files will be placed in that location, not in the root /etc/ directory. The cache file has already been used for a long time (since earlier versions of the template), because it's the only way to support pools with ondemand pools manager. The cache files have a meaningful names, so that a user should easily understand that these cache files are linked to this template. I decided to place the cache files in the same directory as the script files, because:

if a user tries to investigate any problems with the script and checks the /etc/zabbix/ directory, then he or she will also notice the cache files and can read and check them.
from my point of view, for such a small project like this one, a clean and neat approach is to store all the files in one place if possible, and not to spread them into different directories in the system.
when a user want to delete the template and the scripts, then he or she will also see the cache files near the template's script files, and will remove the cache files, too. So, no junk files are left in the system.
the /etc/zabbix/ directory has access to root only by default, so it's a safe place to store the files, and there's a little chance that the cache files may be modified by the user, system or other script. For example, if I put the cache files to /tmp/, then the files there may be deleted by the system during reboot or other scripts or services for system cleanup.

If you have any arguments that my approach is wrong, please, let me know. And share information and your advice where you think the cache files should be located.

bekopharm commented 4 years ago

Well, I can follow your argument that it's best effort for a small project like yours but you will also never get a proper package for it this way because it does not follow the Linux Filesystem Hierarchy Standard. Anyway, it's your thing and I'm aware that I can change this. No hard feelings :)

As for Ubuntu 16.04.6 LTS: End Of Life is in 2021 (and security maintenance even 2024). It's still a lot in use so thanks for aiming to stay compatible with it :)

rvalitov commented 4 years ago

@bekopharm I updated the code and added tests for Ubuntu 14 and 16. Please, test the updated version. Waiting for your feedback.

Regarding the file save location, I will also fix this in next update #49

rvalitov commented 4 years ago

@bekopharm please let me know when you can test the updated code. Thank you!

bekopharm commented 4 years ago

I hope to check this tomorrow or next week. On my list :grin:

rvalitov commented 4 years ago

Thank you! I also updated the location of cache files to /var/cache

bekopharm commented 4 years ago

Script exits fine now but the results are incomplete. I run it several times to give the cache files time to fill up but I only get like ~5 results. When I enable debug I can see that it finds other pools too. It's just not in the final {"data":[]}

rvalitov commented 4 years ago

That's bizarre. Can you please provide the output with debug mode enabled?

rvalitov commented 4 years ago

@bekopharm can you please provide any more details? As far as I understand, the script detects the pools, but does not put them in the resulting output? The reason should be written in the debug output. The detection script finds all pools that look good, and then it tries to communicate with them to verify that the pools do really have the status output configured and return valid data. Perhaps, the pools do not return the valid data or do not respond. Such information should be displayed when you run the script in debug mode. That's the reason why I asked you to provide the debug output.

Please, kindly respond, because this issue have been hanging here for a long time, and stops me from moving forward on further improvements of the script. Thank you in advance!

bekopharm commented 1 year ago

Sorry, totally missed the reply. I upgraded a bunch of hosts today and took a closer look.

In my current scenario I've several processes for multiple pools: That's 212 processes that boil down to 29 unique sockets.

The script from the parallel branch writes sockets for all 212 detected fpm processes to /var/cache/zabbix-php-fpm/php_fpm_pending.cache but never managed to create ANY entries in /var/cache/zabbix-php-fpm/php_fpm_results.cache

I debugged this and it turns out that it kills itself on the very first pool due to max execution time. This is the first time the max execution time is checked if I understood that correct.

If I run it with a hilarious huge execution time it comes back fine.

root@web:/etc/zabbix# time ./zabbix_php_fpm_discovery.sh max_time 70000 max_tasks 1 /status
[...] // json skipped
real    0m36,255s
user    0m32,234s
sys 0m6,804s

Patching it with this truncates the PENDING_LIST and removes all the duplicates:

813,814d812
< readarray -t PENDING_LIST < <(for a in "${PENDING_LIST[@]}"; do echo "$a"; done | $S_SORT | $S_UNIQ)
<

This has a dramatic effect on the runtime:

root@web:/etc/zabbix# time ./zabbix_php_fpm_discovery.sh max_time 70000 max_tasks 1 /status
[...] // json skipped
real    0m8,020s
user    0m6,730s
sys 0m1,441s

Raising max_tasks affects the time slightly but not for the better. It also leaves over lines in php_fpm_pending.cache. The results were added to php_fpm_results.cache just fine though.

I also don't understand why PENDING_LIST is immediately filled up by AnalyzePool() again resulting in all the checking and regenerating of php_fpm_results.cache again. Shouldn't this be skipped if the cache file is recent or existing?

I changed that part like this:

if [[ ${#CACHE[@]} -le 1 ]]; then
  mapfile -t PS_LIST < <($S_PS ax | $S_GREP -F "php-fpm: pool " | $S_GREP -F -v "grep")
  # shellcheck disable=SC2016
  POOL_NAMES_LIST=$($S_PRINTF '%s\n' "${PS_LIST[@]}" | $S_AWK '{print $NF}' | $S_SORT -u)

  #Update pending list with pools that are active and running
  while IFS= read -r POOL_NAME; do
    AnalyzePool "$POOL_NAME"
  done <<<"$POOL_NAMES_LIST"

  readarray -t PENDING_LIST < <(for a in "${PENDING_LIST[@]}"; do echo "$a"; done | $S_SORT | $S_UNIQ)
else
    PrintDebug "Cache data found, skipping detection of pools from process list"
fi

Now it's coming back in a reasonable time once the cache is filled:

real    0m1,337s
user    0m1,287s
sys 0m0,063s

I'm aware that I have to clear the cache file manually for updating the pool list now:

echo "" > /var/cache/zabbix-php-fpm/php_fpm_results.cache

or

rm /var/cache/zabbix-php-fpm/php_fpm_results.cache

rvalitov commented 1 year ago

Thank you so much for sharing your experience and investigation. I will look into this. May I know what version of Zabbix do you use?

bekopharm commented 1 year ago

v6.0 (LTS) and v6.4 (LTS)

rvalitov commented 1 year ago

I will check this after completing:

67
68

rvalitov / zabbix-php-fpm

Improve discover script execution time #31

67

68