sni / lmd

Livestatus Multitool Daemon - Create livestatus federation from multiple sources
https://labs.consol.de/omd/packages/lmd/
GNU General Public License v3.0
42 stars 31 forks source link

LMD 1.9.4 - disorder in `custom_variable_names` and `custom_variable_values` #110

Closed dgilm closed 2 years ago

dgilm commented 3 years ago

Here is an example:

# telnet localhost 50000 
Trying ::1...
Connected to wocu-monitoring-aio.
Escape character is '^]'.
GET hosts
Columns: custom_variable_names custom_variable_values
Filter: host_name = mysnmphostname
[[["SNMPCOMMUNITY","IFACES_BYNAME","LAT","MEMFREE_CRITICAL_THRESHOLD","LONG","SNMPVERSION","TRAFFIC_CRITICAL_THRESHOLD","CPU_WARNING_THRESHOLD","CPU_CRITICAL_THRESHOLD","MEMFREE_WARNING_THRESHOLD","TRAFFIC_WARNING_THRESHOLD","TRAFFIC_SUM_IFACES_REGEXP","DEVICEVENDOR"],["public","ethernet0/1.2815$(ethernet0/1.2815)$$(1000000000)$$(500000000)$$(b)$","10","2c","90","75","90","25","75","^ethernet0/1.2815$$","Teldat"]]

names and values don't match each other:

names = ["SNMPCOMMUNITY","IFACES_BYNAME","LAT","MEMFREE_CRITICAL_THRESHOLD","LONG","SNMPVERSION","TRAFFIC_CRITICAL_THRESHOLD","CPU_WARNING_THRESHOLD","CPU_CRITICAL_THRESHOLD","MEMFREE_WARNING_THRESHOLD","TRAFFIC_WARNING_THRESHOLD","TRAFFIC_SUM_IFACES_REGEXP","DEVICEVENDOR"]

values = ["public","ethernet0/1.2815$(ethernet0/1.2815)$$(1000000000)$$(500000000)$$(b)$","10","2c","90","75","90","25","75","^ethernet0/1.2815$$","Teldat"]

print(names[5])
SNMPVERSION

print(values[5])
75  # this is a threshold value, it should be the snmp version

print(value[3])
2c  # here it is!

>>> len(names)
13

>>> len(values)
11

Restarting lmd (full-scan) fixes the problem but in a few minutes, the list are messed up again, so the bug should be in the delta-update.

Version 1.9.1 is working fine.

sni commented 3 years ago

what kind of backend is this?

dgilm commented 3 years ago

shinken

dgilm commented 3 years ago

Mmmm, seems that LAT and LONG are missed in the values list.

jacobbaungard commented 3 years ago

Just wanted to chime in and say I tried this on a Naemon backend and wasn't able to reproduce it. Added a single host with 5 custom variables, and it looks good so far, checking back a few times over an hour.

I also tried to change the custom variables using an external command, and still was OK (although it does take quite a while to sync, feels like it only updates on a full-update, which seems fair enough to me)

echo "GET hosts\nColumns: custom_variable_names custom_variable_values" | unixcat /lmd_path
,[["TEST1","TEST2","TEST3","TEST4","TEST5"],["test1","test2","test3","test4","test5"]]
]
sni commented 3 years ago

could you verify if shinken returns the custom variables always in the same order? So if you fetch custom_variable_names and custom_variable_values directly from shinken, does it return the same values/order all the time?

dgilm commented 3 years ago

I was able to reproduce the issue, let me show you a few queries and their results.

I have the following query:

❯ cat query.txt
GET hosts
Columns: custom_variable_names
Filter: host_name = wocu-devel

Step 1: original state

So, I run the query against shinken through its socket:

❯ cat query.txt | socat  - UNIX-CONNECT:/var/opt/wocu/run/shinken/livestatus.sock
SNMPCOMMUNITY,STORAGE_PATH,WOCU_DAEMONS,CPU_WARN,API_USER,SNMP_MSG_MAX_SIZE,HTTP_LIST2,SNMPAUTHPROTOCOL,PSMEM_VISOR_DAEMONS,API_PASSWORD,CHECK_HTTPS_MINIMUM_DAYS,DEVICECONTACT,UPTIME_WARN,CHKLOG_CONF,SNMPPRIVKEY,CHECK_HTTPS_AUTH,TCP_VISOR_SERVICES,SNMPSECURITYNAME,NTP_WARN,SEARCH_CRITERIA,TCP_CONN_WARN,LONG,SNMPVERSION,CHECK_HTTPS_PORT,LOAD_CRIT,STORAGE_CRIT,SSH_PASSWORD,HTTPS_LIST2,DEVICELOCATION,STORAGE_WARN,SNMPPRIVPROTOCOL,SNMPAUTHKEY,LAT,CHECK_HTTPS_DOMAIN_NAME,DEVICEDESC,TRAFFIC_CRITICAL_THRESHOLD,UPTIME_CRIT,CERT_LIST2,LOAD_WARN,FILES,TRAFFIC_WARNING_THRESHOLD,IFACES,MEMORY_CRIT,CHECK_HTTPS_URI,SNMPSECURITYLEVEL,MYSQLPASS,TCP_CONN_CRIT,CPU_CRIT,CHECK_HTTP_DOMAIN_NAME,IDENTIFIER_FIELD,CHECK_HTTP_URI,SNMPCONTEXT,CHECK_HTTP_PORT,WOCU_REALMS,CHECK_HTTP_AUTH,SSH_USER,CRITICAL_TIME_RESPONSE,MEMORY_WARN,NTP_CRIT,WARNING_TIME_RESPONSE,MYSQLUSER,DEVICEVENDOR

Results: 62 names

Now I run the query against LMD:

❯ cat query.txt| netcat localhost 50000
[[["SNMPCOMMUNITY","STORAGE_PATH","WOCU_DAEMONS","CPU_WARN","API_USER","SNMP_MSG_MAX_SIZE","HTTP_LIST2","SNMPAUTHPROTOCOL","PSMEM_VISOR_DAEMONS","API_PASSWORD","CHECK_HTTPS_MINIMUM_DAYS","DEVICECONTACT","UPTIME_WARN","CHKLOG_CONF","SNMPPRIVKEY","CHECK_HTTPS_AUTH","TCP_VISOR_SERVICES","SNMPSECURITYNAME","NTP_WARN","SEARCH_CRITERIA","TCP_CONN_WARN","LONG","SNMPVERSION","CHECK_HTTPS_PORT","LOAD_CRIT","STORAGE_CRIT","SSH_PASSWORD","HTTPS_LIST2","DEVICELOCATION","STORAGE_WARN","SNMPPRIVPROTOCOL","SNMPAUTHKEY","LAT","CHECK_HTTPS_DOMAIN_NAME","DEVICEDESC","TRAFFIC_CRITICAL_THRESHOLD","UPTIME_CRIT","CERT_LIST2","LOAD_WARN","FILES","TRAFFIC_WARNING_THRESHOLD","IFACES","MEMORY_CRIT","CHECK_HTTPS_URI","SNMPSECURITYLEVEL","MYSQLPASS","TCP_CONN_CRIT","CPU_CRIT","CHECK_HTTP_DOMAIN_NAME","IDENTIFIER_FIELD","CHECK_HTTP_URI","SNMPCONTEXT","CHECK_HTTP_PORT","WOCU_REALMS","CHECK_HTTP_AUTH","SSH_USER","CRITICAL_TIME_RESPONSE","MEMORY_WARN","NTP_CRIT","WARNING_TIME_RESPONSE","MYSQLUSER","DEVICEVENDOR"]]
]

Results: 62 names and with the same order, everything is ok

Step 2: adding a _BLABLA custom variable to the configuration and restarting shinken

query against shinken:

❯ cat query.txt | socat  - UNIX-CONNECT:/var/opt/wocu/run/shinken/livestatus.sock | grep --color BLABLA
SNMPCOMMUNITY,STORAGE_PATH,WOCU_DAEMONS,CHKLOG_CONF,CPU_WARN,API_USER,SNMP_MSG_MAX_SIZE,HTTP_LIST2,SNMPAUTHPROTOCOL,PSMEM_VISOR_DAEMONS,API_PASSWORD,CHECK_HTTPS_MINIMUM_DAYS,DEVICECONTACT,UPTIME_WARN,BLABLA,SNMPPRIVKEY,CHECK_HTTPS_AUTH,TCP_VISOR_SERVICES,SNMPSECURITYNAME,NTP_WARN,SEARCH_CRITERIA,TCP_CONN_WARN,LONG,SNMPVERSION,CHECK_HTTPS_PORT,LOAD_CRIT,STORAGE_CRIT,SSH_PASSWORD,HTTPS_LIST2,DEVICELOCATION,STORAGE_WARN,SNMPPRIVPROTOCOL,SNMPAUTHKEY,LAT,CHECK_HTTPS_DOMAIN_NAME,DEVICEDESC,TRAFFIC_CRITICAL_THRESHOLD,UPTIME_CRIT,CERT_LIST2,LOAD_WARN,FILES,TRAFFIC_WARNING_THRESHOLD,IFACES,MEMORY_CRIT,CHECK_HTTPS_URI,SNMPSECURITYLEVEL,MYSQLPASS,TCP_CONN_CRIT,CPU_CRIT,CHECK_HTTP_DOMAIN_NAME,IDENTIFIER_FIELD,CHECK_HTTP_URI,SNMPCONTEXT,CHECK_HTTP_PORT,WOCU_REALMS,CHECK_HTTP_AUTH,SSH_USER,CRITICAL_TIME_RESPONSE,MEMORY_WARN,NTP_CRIT,WARNING_TIME_RESPONSE,MYSQLUSER,DEVICEVENDOR

Results: 63 names with _BLABLA

query against LMD:

❯ cat query.txt| netcat localhost 50000
[[["SNMPCOMMUNITY","STORAGE_PATH","WOCU_DAEMONS","CHKLOG_CONF","CPU_WARN","API_USER","SNMP_MSG_MAX_SIZE","HTTP_LIST2","SNMPAUTHPROTOCOL","PSMEM_VISOR_DAEMONS","API_PASSWORD","CHECK_HTTPS_MINIMUM_DAYS","DEVICECONTACT","UPTIME_WARN","BLABLA","SNMPPRIVKEY","CHECK_HTTPS_AUTH","TCP_VISOR_SERVICES","SNMPSECURITYNAME","NTP_WARN","SEARCH_CRITERIA","TCP_CONN_WARN","LONG","SNMPVERSION","CHECK_HTTPS_PORT","LOAD_CRIT","STORAGE_CRIT","SSH_PASSWORD","HTTPS_LIST2","DEVICELOCATION","STORAGE_WARN","SNMPPRIVPROTOCOL","SNMPAUTHKEY","LAT","CHECK_HTTPS_DOMAIN_NAME","DEVICEDESC","TRAFFIC_CRITICAL_THRESHOLD","UPTIME_CRIT","CERT_LIST2","LOAD_WARN","FILES","TRAFFIC_WARNING_THRESHOLD","IFACES","MEMORY_CRIT","CHECK_HTTPS_URI","SNMPSECURITYLEVEL","MYSQLPASS","TCP_CONN_CRIT","CPU_CRIT","CHECK_HTTP_DOMAIN_NAME","IDENTIFIER_FIELD","CHECK_HTTP_URI","SNMPCONTEXT","CHECK_HTTP_PORT","WOCU_REALMS","CHECK_HTTP_AUTH","SSH_USER","CRITICAL_TIME_RESPONSE","MEMORY_WARN","NTP_CRIT","WARNING_TIME_RESPONSE","MYSQLUSER","DEVICEVENDOR"]]
]

Results: 63 names with _BLABLA. Still everything is OK.

Step 3: remove that _BLABLA custom variable from configuration and restarting shinken

query against shinken:

❯ cat query.txt | socat  - UNIX-CONNECT:/var/opt/wocu/run/shinken/livestatus.sock
SNMPCOMMUNITY,STORAGE_PATH,WOCU_DAEMONS,CPU_WARN,API_USER,SNMP_MSG_MAX_SIZE,HTTP_LIST2,SNMPAUTHPROTOCOL,PSMEM_VISOR_DAEMONS,API_PASSWORD,CHECK_HTTPS_MINIMUM_DAYS,DEVICECONTACT,UPTIME_WARN,CHKLOG_CONF,SNMPPRIVKEY,CHECK_HTTPS_AUTH,TCP_VISOR_SERVICES,SNMPSECURITYNAME,NTP_WARN,SEARCH_CRITERIA,TCP_CONN_WARN,LONG,SNMPVERSION,CHECK_HTTPS_PORT,LOAD_CRIT,STORAGE_CRIT,SSH_PASSWORD,HTTPS_LIST2,DEVICELOCATION,STORAGE_WARN,SNMPPRIVPROTOCOL,SNMPAUTHKEY,LAT,CHECK_HTTPS_DOMAIN_NAME,DEVICEDESC,TRAFFIC_CRITICAL_THRESHOLD,UPTIME_CRIT,CERT_LIST2,LOAD_WARN,FILES,TRAFFIC_WARNING_THRESHOLD,IFACES,MEMORY_CRIT,CHECK_HTTPS_URI,SNMPSECURITYLEVEL,MYSQLPASS,TCP_CONN_CRIT,CPU_CRIT,CHECK_HTTP_DOMAIN_NAME,IDENTIFIER_FIELD,CHECK_HTTP_URI,SNMPCONTEXT,CHECK_HTTP_PORT,WOCU_REALMS,CHECK_HTTP_AUTH,SSH_USER,CRITICAL_TIME_RESPONSE,MEMORY_WARN,NTP_CRIT,WARNING_TIME_RESPONSE,MYSQLUSER,DEVICEVENDOR

Results: 62 names without _BLABLA

query against LMD:

❯ cat query.txt| netcat localhost 50000
[[["SNMPCOMMUNITY","STORAGE_PATH","WOCU_DAEMONS","CHKLOG_CONF","CPU_WARN","API_USER","SNMP_MSG_MAX_SIZE","HTTP_LIST2","SNMPAUTHPROTOCOL","PSMEM_VISOR_DAEMONS","API_PASSWORD","CHECK_HTTPS_MINIMUM_DAYS","DEVICECONTACT","UPTIME_WARN","BLABLA","SNMPPRIVKEY","CHECK_HTTPS_AUTH","TCP_VISOR_SERVICES","SNMPSECURITYNAME","NTP_WARN","SEARCH_CRITERIA","TCP_CONN_WARN","LONG","SNMPVERSION","CHECK_HTTPS_PORT","LOAD_CRIT","STORAGE_CRIT","SSH_PASSWORD","HTTPS_LIST2","DEVICELOCATION","STORAGE_WARN","SNMPPRIVPROTOCOL","SNMPAUTHKEY","LAT","CHECK_HTTPS_DOMAIN_NAME","DEVICEDESC","TRAFFIC_CRITICAL_THRESHOLD","UPTIME_CRIT","CERT_LIST2","LOAD_WARN","FILES","TRAFFIC_WARNING_THRESHOLD","IFACES","MEMORY_CRIT","CHECK_HTTPS_URI","SNMPSECURITYLEVEL","MYSQLPASS","TCP_CONN_CRIT","CPU_CRIT","CHECK_HTTP_DOMAIN_NAME","IDENTIFIER_FIELD","CHECK_HTTP_URI","SNMPCONTEXT","CHECK_HTTP_PORT","WOCU_REALMS","CHECK_HTTP_AUTH","SSH_USER","CRITICAL_TIME_RESPONSE","MEMORY_WARN","NTP_CRIT","WARNING_TIME_RESPONSE","MYSQLUSER","DEVICEVENDOR"]]
]

Results: 63 names WITH _BLABLA and columns in different order

As I said in previous messages, a full restart of LMD fixes the issue getting the right number of custom variables and with the expected order.

Let me now @sni if I can test another thing in order to help you.

sni commented 3 years ago

Ok, as defined in https://github.com/sni/lmd/blob/master/lmd/objects.go#L446-L447 the variable names are static and only refreshed on backend reload. The values are synchronized constantly. So if you restart shinken and the old variabe is still there, it bascially means LMD does not recognize the shinken reload at all. Could you run the query GET status\nOutputFormat: json\nColumns: program_start nagios_pid\n before and after a reload of the core? Because thats the 2 things LMD checks to detect a backend reload.

dgilm commented 3 years ago

could you verify if shinken returns the custom variables always in the same order? So if you fetch custom_variable_names and custom_variable_values directly from shinken, does it return the same values/order all the time?

Forgot it. If I do not change the configuration, the results are always in the same order.

dgilm commented 3 years ago

Ok, as defined in https://github.com/sni/lmd/blob/master/lmd/objects.go#L446-L447 the variable names are static and only refreshed on backend reload. The values are synchronized constantly. So if you restart shinken and the old variabe is still there, it bascially means LMD does not recognize the shinken reload at all.

Understood.

Could you run the query GET status\nOutputFormat: json\nColumns: program_start nagios_pid\n before and after a reload of the core? Because thats the 2 things LMD checks to detect a backend reload.

program_start changes its value but nagios_pid does not.

[[1616067095,11292]] [[1616071392,11292]]

This is because of the shinken architecture, only the arbiter daemon (responsible of managing the configuration) is restarted, the other daemons just receive the new configuration.

The nagios_pid corresponds to the shinken-scheduler daemon which is not restarted for updating the configuration.

More info: https://shinken.readthedocs.io/en/latest/09_architecture/the-shinken-architecture.html

sni commented 3 years ago

That's ok. You should see a backend restarting message in LMDs logfile then. Could it be, that the program_start is changed to early in shinken due to its architecture? What i mean is the following scenario. You reload shinken, program_start is updated to current timestamp, but propagation of the objects takes a bit longer and lmd is finished with the initial sync already (including the old custom variables). This could be tested easily if a second reload (with a small delay) fixes the issue.

dgilm commented 3 years ago

I think you are right about the behavior you've just described. It's a strange behavior and it's quite difficult to reproduce.

But, even though I understand your point, we don't control the users of our applications, right? It is not just a question of restarting again a few seconds (minutes) later.

I think that custom_variable_names and custom_variable_values are heavily related, so they should be both dynamic or static in their definition to avoid this kind of mismatches. But sure there is something related to performance that I am missing.

sni commented 3 years ago

well, LMD tries to minimize the data it has to synchronize so it separates things that are static and things that change during runtime. You are seeing this issue with custom variables, but it then might be the case with all other "static" parts of the configuration as well. Maybe this could be solved in shinken if they update the program_start after updating everything, not before. Or they know any other way how to correctly determine if shinken has finished reloading its config. Last ressort would be to just delay refetching the objects by a dew seconds, but thats clearly just a bad hack.

dgilm commented 3 years ago

Maybe this could be solved in shinken if they update the program_start after updating everything

I think so, it may be too early because the configuration is not yet fully managed and distributed to the broker daemon at this point: https://github.com/naparuba/shinken/blob/master/shinken/scheduler.py#L162

sni commented 3 years ago

yeah, would be better to set the program_start at the end. But now idea if there are any side effects on shinkens side then.