Open bekopharm opened 4 years ago
Hi! We should find where the problem happens. I can only think of the following command, please run it and check its execution speed:
ps aux | grep "php-fpm"
Also, please, give me some information about the CPU load and IO load, just to get an idea how loaded your server is. Utils like htop
and iotop
can help. Alternatively, you can get the values from Zabbix monitoring.
time ps aux | grep "php-fpm" real 0m0.026s
time pgrep "php-fpm" real 0m0.015s
time /etc/zabbix/zabbix_php_fpm_discovery.sh real 0m2.520s
load average: 4,70, 4,51, 4,75 iowait 10.3738 % (Catched it during backup and it's a network share in another center)
Thank you!
Originally I thought that there could be a problem with ps
(like this and similar), but that's not the case.
I guess the only reason this problem happens is that the discovery script queries each detected pool to confirm that it's functional and to obtain its process manager type (ondemand, dynamic, static). If you have lots of pools then this operation may take some time.
The approach above ensures that all the pools that you see in Zabbix really report status information. The pools that do not report the information are ignored and are not shown in Zabbix at all. I thought that this is useful, because in this case you don't see errors in Zabbix for the pools that you don't want to monitor (i.e. where the status page is disabled). But it seems that such queries require some time.
I have currently 2 ideas here:
While the last idea seems logical, I don't want to work with cron, because it will make the configuration of this template more complicated. But I don't see any alternative solution that is simple and elegant.
I've the source not open but are you piping it through sort
and uniq
? Some pools may double.
Yes, there are no duplicates. You can also check that by running the script manually with debug mode enabled.
Just for information: I've been thinking about this issue and how to improve it. A new idea is to use intermediate cache and the script will perform a self check of execution time. If the execution time goes near 3 sec then it will give the results it has already processed and will save its execution state. Next time the discovery script is launched it will restore its execution state and will continue from that point where it stopped.
So, now I'm ready to move on from the thinking phase to actual programming.
@bekopharm Some information for you. I could reproduce this issue, in my tests I used 400 pools via Travis CI, which is now used for testing of this project. The work to fix this issue is ongoing.
/remind me this issue in 10 days
@rvalitov set a reminder for Jul 11th 2020
A quick workaround with the current template would be to modify the Timeout
option of the agent (and probably on the server):
@bekopharm I made a new pull request #42 that should solve this issue. Can you please test it? You should just download the new files from that pull request. Kindly waiting for your feedback, thank you!
:wave: @rvalitov, this issue
Going to check this week @rvalitov. Drowning in work atm but I've seen it :+1:
@rvalitov Thanks for your patience. I got around testing it today. This branch yields [after installing bc
]:
mapfile: -d: invalid option
Issue is Ubuntu 16.04.6 LTS
Bash is GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)
mapfile: usage: mapfile [-n count] [-O origin] [-s count] [-t] [-u fd] [-C callback] [-c quantum] [array]
Also I'm not happy with the cache files at $(${S_DIRNAME} "$0")
because this results (according to your provided script location in userparameter_php_fpm.conf
) to an /etc
folder. While there should be no executable located in /etc
anyway it's definitely the wrong location for cache
. Thinking about ReadOnly Root and similar here.
@bekopharm Thank you so much for the testing!
Indeed, bc
is required, the Wiki will be updated after the new code will be merged.
The error with mapfile
happens because the bash
is old and does not support this option. It seems that bash
version 4.5 or newer is required for it to work. I will rewrite the code to support older bash
. And will also add test cases for older Ubuntu system. Currently I've been testing only for Ubuntu 18 Bionic
and later.
Regarding the cache file. If you have correctly installed the script files, i.e. copied them to correct location: /etc/zabbix/
, then the cache files will be placed in that location, not in the root /etc/
directory. The cache file has already been used for a long time (since earlier versions of the template), because it's the only way to support pools with ondemand
pools manager. The cache files have a meaningful names, so that a user should easily understand that these cache files are linked to this template. I decided to place the cache files in the same directory as the script files, because:
/etc/zabbix/
directory, then he or she will also notice the cache files and can read and check them./etc/zabbix/
directory has access to root
only by default, so it's a safe place to store the files, and there's a little chance that the cache files may be modified by the user, system or other script. For example, if I put the cache files to /tmp/
, then the files there may be deleted by the system during reboot or other scripts or services for system cleanup.If you have any arguments that my approach is wrong, please, let me know. And share information and your advice where you think the cache files should be located.
Well, I can follow your argument that it's best effort for a small project like yours but you will also never get a proper package for it this way because it does not follow the Linux Filesystem Hierarchy Standard. Anyway, it's your thing and I'm aware that I can change this. No hard feelings :)
As for Ubuntu 16.04.6 LTS
: End Of Life is in 2021 (and security maintenance even 2024). It's still a lot in use so thanks for aiming to stay compatible with it :)
@bekopharm I updated the code and added tests for Ubuntu 14 and 16. Please, test the updated version. Waiting for your feedback.
Regarding the file save location, I will also fix this in next update #49
@bekopharm please let me know when you can test the updated code. Thank you!
I hope to check this tomorrow or next week. On my list :grin:
Thank you! I also updated the location of cache files to /var/cache
Script exits fine now but the results are incomplete. I run it several times to give the cache files time to fill up but I only get like ~5 results. When I enable debug I can see that it finds other pools too. It's just not in the final {"data":[]}
That's bizarre. Can you please provide the output with debug mode enabled?
@bekopharm can you please provide any more details? As far as I understand, the script detects the pools, but does not put them in the resulting output? The reason should be written in the debug output. The detection script finds all pools that look good, and then it tries to communicate with them to verify that the pools do really have the status output configured and return valid data. Perhaps, the pools do not return the valid data or do not respond. Such information should be displayed when you run the script in debug mode. That's the reason why I asked you to provide the debug output.
Please, kindly respond, because this issue have been hanging here for a long time, and stops me from moving forward on further improvements of the script. Thank you in advance!
Sorry, totally missed the reply. I upgraded a bunch of hosts today and took a closer look.
In my current scenario I've several processes for multiple pools: That's 212 processes that boil down to 29 unique sockets.
The script from the parallel branch writes sockets for all 212 detected fpm processes to /var/cache/zabbix-php-fpm/php_fpm_pending.cache
but never managed to create ANY entries in /var/cache/zabbix-php-fpm/php_fpm_results.cache
I debugged this and it turns out that it kills itself on the very first pool due to max execution time. This is the first time the max execution time is checked if I understood that correct.
If I run it with a hilarious huge execution time it comes back fine.
root@web:/etc/zabbix# time ./zabbix_php_fpm_discovery.sh max_time 70000 max_tasks 1 /status
[...] // json skipped
real 0m36,255s
user 0m32,234s
sys 0m6,804s
Patching it with this truncates the PENDING_LIST and removes all the duplicates:
813,814d812
< readarray -t PENDING_LIST < <(for a in "${PENDING_LIST[@]}"; do echo "$a"; done | $S_SORT | $S_UNIQ)
<
This has a dramatic effect on the runtime:
root@web:/etc/zabbix# time ./zabbix_php_fpm_discovery.sh max_time 70000 max_tasks 1 /status
[...] // json skipped
real 0m8,020s
user 0m6,730s
sys 0m1,441s
Raising max_tasks
affects the time slightly but not for the better. It also leaves over lines in php_fpm_pending.cache
. The results were added to php_fpm_results.cache
just fine though.
I also don't understand why PENDING_LIST is immediately filled up by AnalyzePool()
again resulting in all the checking and regenerating of php_fpm_results.cache
again. Shouldn't this be skipped if the cache file is recent or existing?
I changed that part like this:
if [[ ${#CACHE[@]} -le 1 ]]; then
mapfile -t PS_LIST < <($S_PS ax | $S_GREP -F "php-fpm: pool " | $S_GREP -F -v "grep")
# shellcheck disable=SC2016
POOL_NAMES_LIST=$($S_PRINTF '%s\n' "${PS_LIST[@]}" | $S_AWK '{print $NF}' | $S_SORT -u)
#Update pending list with pools that are active and running
while IFS= read -r POOL_NAME; do
AnalyzePool "$POOL_NAME"
done <<<"$POOL_NAMES_LIST"
readarray -t PENDING_LIST < <(for a in "${PENDING_LIST[@]}"; do echo "$a"; done | $S_SORT | $S_UNIQ)
else
PrintDebug "Cache data found, skipping detection of pools from process list"
fi
Now it's coming back in a reasonable time once the cache is filled:
real 0m1,337s
user 0m1,287s
sys 0m0,063s
I'm aware that I have to clear the cache file manually for updating the pool list now:
echo "" > /var/cache/zabbix-php-fpm/php_fpm_results.cache
or
rm /var/cache/zabbix-php-fpm/php_fpm_results.cache
Thank you so much for sharing your experience and investigation. I will look into this. May I know what version of Zabbix do you use?
v6.0 (LTS) and v6.4 (LTS)
I will check this after completing:
The discover script is unreasonable slow. While it works okay for a few pools it scratches at the default 3s timeout of
zabbix-server
andzabbix-agent
with ~20 pools _having statuspath already and results often inZBX_NOTSUPPORTED: Timeout while executing a shell script
.Here are some typical discovery round trips with 20 pools _having statuspath (server has ~200 pool processes in total going) for me:
And here the same executed locally (so it's hardly network latency and ZBX overhead)
Otherwise it's a great script 👍. I'm afraid to use this on a server that has even more pools going tho. I mean mileage may vary - this server has load after all.