Don't sleep in sender_wrapper.py

asomers commented 4 years ago

I'm struggling to understand the purpose of the configurable "timeout". It doesn't function as a timeout at all. Instead, it's a fixed-length sleep that gets added before sending results in get mode, but not getverb. The comment says "wait for LLD to be processed by server". The only two motivations I can think of are:

1) The server processes multiple responses from the same agent out-of-order. That seems unlikely. It would be a server bug if true.

2) The server performs LLD using a getverb operation, which sender_wrapper handles by forking and running in the background. That might allow it to process a subsequent get operation before the original getverb were complete. If this is true, then the correct solution is to block while sending a response rather than running it in the background.

Unless you tell me otherwise, I'm going to assume that 2 is the case. I'll fix it and remove the timeout setting.

nobody43 commented 4 years ago

Current situation When zabbix server performs the check ('get') the LLD is composed and returned. There are no need to run same commands the second time because all data are already gathered. But, immediate trapper sending is inadvisable cause server processes LLD within ~60 seconds - the items will not be there (dependent on server load?). Therefore, timeout is introduced. It's 0 on Windows cause I was unable to pipe and fork at the same time. The fork is needed for LLD to be able to respond initially. I'm strongly against cache files. sender_wrapper.py was invented when there are no Dependent Items available (which is still the case for Zabbix 3.0 - the script must be oldest LTS compatible). And, it might be the solution for multiple server support when latest Zabbix is not available.

Thinking about the solution for 500 drives The structure of the script must be redone completely. I'm theorizing about cascading LLD, if it's even possible. First, an LLD must gather all available disks. Then each disk, with separate LLD (trapper?), will discover it's own SMART names and will send gathered values. The timeout is still needed. I'm still unsure whether disk duplicate check is possible with this approach.

nobody43 commented 4 years ago

getverb is for debug only.

asomers commented 4 years ago

So LLD uses get too? In that case, LLD gets the sleep. What's the point of sleeping if LLD and regular polling both get the same sleep? And what do you mean by "trapper"? And what does any of this have to do with cache files? I'm confused.

nobody43 commented 4 years ago

get is a regular LLD polling which returns json (smartctl.discovery[get,{HOST.HOST}]). It's sleeping while the prototype items are created. Then items with values are sent in bulk with zabbix trapper. It's a confusing scheme, but I only acted in boundaries of zabbix capabilities. :) One way all of this could be achieved is with cache files for disk output. I'm not going that way.

asomers commented 4 years ago

Ok, so after a few days of study I understand better how Zabbix works and why you do what you do:

Normally, LLD discovery would return a list of disks, and the Zabbix master would poll those at some interval. That's what https://github.com/v-zhuravlev/zbx-smartctl does.
You chose instead to use "trapper" items in order to minimize the amount of network activity. It's a stupid name, but it basically means "push notifications". Minimizing the amount of network activity is good.
Sending the detailed smart values with zabbix_sender could happen at any time, even from a cron job. If new disks have arrived since the last LLD, that's fine: the server will ignore the data for the new disks. I verified that experimentally. If the data from zabbix_sender arrives before the very first LLD, then it will all be discarded. That's not too bad, because it will still be picked up the next time the server polls LLD.
Since you decided to send detailed smart data during discovery (not strictly necessary), you added the fork. That allows the user parameter script to return its LLD data before zabbix_sender returns detailed smart data.
In order to close a race between the parent and child processes, you added the "timeout" parameter (really a delay, not a timeout). That's bad; hard-coded delays are always bad.
You also collect detailed data with smartctl during discovery. That's fine for systems with just a few disks, but on large systems it exceeds zabbix's hard-coded 30s timeout for user parameter scripts

So my plan is:

Keep using trapper items. Minimizing network activity is good.
Remove the hard-coded delay. Instead, fork the user parameter script and daemonize it. Print LLD results from the master process. In the child, waitpid on the master (replaces the hard-coded delay).
Move detailed parameter collection from the master process to the child process, so the 30s timeout won't be an issue.

nobody43 commented 4 years ago

I'm good with the plan, as long as the daemon will not have polling capabilities (on it's own).

nobody43 / zabbix-smartmontools

Don't sleep in sender_wrapper.py #28