Closed dgilm closed 2 years ago
what kind of backend is this?
shinken
Mmmm, seems that LAT and LONG are missed in the values list.
Just wanted to chime in and say I tried this on a Naemon backend and wasn't able to reproduce it. Added a single host with 5 custom variables, and it looks good so far, checking back a few times over an hour.
I also tried to change the custom variables using an external command, and still was OK (although it does take quite a while to sync, feels like it only updates on a full-update, which seems fair enough to me)
echo "GET hosts\nColumns: custom_variable_names custom_variable_values" | unixcat /lmd_path
,[["TEST1","TEST2","TEST3","TEST4","TEST5"],["test1","test2","test3","test4","test5"]]
]
could you verify if shinken returns the custom variables always in the same order? So if you fetch custom_variable_names and custom_variable_values directly from shinken, does it return the same values/order all the time?
I was able to reproduce the issue, let me show you a few queries and their results.
I have the following query:
❯ cat query.txt
GET hosts
Columns: custom_variable_names
Filter: host_name = wocu-devel
So, I run the query against shinken through its socket:
❯ cat query.txt | socat - UNIX-CONNECT:/var/opt/wocu/run/shinken/livestatus.sock
SNMPCOMMUNITY,STORAGE_PATH,WOCU_DAEMONS,CPU_WARN,API_USER,SNMP_MSG_MAX_SIZE,HTTP_LIST2,SNMPAUTHPROTOCOL,PSMEM_VISOR_DAEMONS,API_PASSWORD,CHECK_HTTPS_MINIMUM_DAYS,DEVICECONTACT,UPTIME_WARN,CHKLOG_CONF,SNMPPRIVKEY,CHECK_HTTPS_AUTH,TCP_VISOR_SERVICES,SNMPSECURITYNAME,NTP_WARN,SEARCH_CRITERIA,TCP_CONN_WARN,LONG,SNMPVERSION,CHECK_HTTPS_PORT,LOAD_CRIT,STORAGE_CRIT,SSH_PASSWORD,HTTPS_LIST2,DEVICELOCATION,STORAGE_WARN,SNMPPRIVPROTOCOL,SNMPAUTHKEY,LAT,CHECK_HTTPS_DOMAIN_NAME,DEVICEDESC,TRAFFIC_CRITICAL_THRESHOLD,UPTIME_CRIT,CERT_LIST2,LOAD_WARN,FILES,TRAFFIC_WARNING_THRESHOLD,IFACES,MEMORY_CRIT,CHECK_HTTPS_URI,SNMPSECURITYLEVEL,MYSQLPASS,TCP_CONN_CRIT,CPU_CRIT,CHECK_HTTP_DOMAIN_NAME,IDENTIFIER_FIELD,CHECK_HTTP_URI,SNMPCONTEXT,CHECK_HTTP_PORT,WOCU_REALMS,CHECK_HTTP_AUTH,SSH_USER,CRITICAL_TIME_RESPONSE,MEMORY_WARN,NTP_CRIT,WARNING_TIME_RESPONSE,MYSQLUSER,DEVICEVENDOR
Results: 62 names
Now I run the query against LMD:
❯ cat query.txt| netcat localhost 50000
[[["SNMPCOMMUNITY","STORAGE_PATH","WOCU_DAEMONS","CPU_WARN","API_USER","SNMP_MSG_MAX_SIZE","HTTP_LIST2","SNMPAUTHPROTOCOL","PSMEM_VISOR_DAEMONS","API_PASSWORD","CHECK_HTTPS_MINIMUM_DAYS","DEVICECONTACT","UPTIME_WARN","CHKLOG_CONF","SNMPPRIVKEY","CHECK_HTTPS_AUTH","TCP_VISOR_SERVICES","SNMPSECURITYNAME","NTP_WARN","SEARCH_CRITERIA","TCP_CONN_WARN","LONG","SNMPVERSION","CHECK_HTTPS_PORT","LOAD_CRIT","STORAGE_CRIT","SSH_PASSWORD","HTTPS_LIST2","DEVICELOCATION","STORAGE_WARN","SNMPPRIVPROTOCOL","SNMPAUTHKEY","LAT","CHECK_HTTPS_DOMAIN_NAME","DEVICEDESC","TRAFFIC_CRITICAL_THRESHOLD","UPTIME_CRIT","CERT_LIST2","LOAD_WARN","FILES","TRAFFIC_WARNING_THRESHOLD","IFACES","MEMORY_CRIT","CHECK_HTTPS_URI","SNMPSECURITYLEVEL","MYSQLPASS","TCP_CONN_CRIT","CPU_CRIT","CHECK_HTTP_DOMAIN_NAME","IDENTIFIER_FIELD","CHECK_HTTP_URI","SNMPCONTEXT","CHECK_HTTP_PORT","WOCU_REALMS","CHECK_HTTP_AUTH","SSH_USER","CRITICAL_TIME_RESPONSE","MEMORY_WARN","NTP_CRIT","WARNING_TIME_RESPONSE","MYSQLUSER","DEVICEVENDOR"]]
]
Results: 62 names and with the same order, everything is ok
query against shinken:
❯ cat query.txt | socat - UNIX-CONNECT:/var/opt/wocu/run/shinken/livestatus.sock | grep --color BLABLA
SNMPCOMMUNITY,STORAGE_PATH,WOCU_DAEMONS,CHKLOG_CONF,CPU_WARN,API_USER,SNMP_MSG_MAX_SIZE,HTTP_LIST2,SNMPAUTHPROTOCOL,PSMEM_VISOR_DAEMONS,API_PASSWORD,CHECK_HTTPS_MINIMUM_DAYS,DEVICECONTACT,UPTIME_WARN,BLABLA,SNMPPRIVKEY,CHECK_HTTPS_AUTH,TCP_VISOR_SERVICES,SNMPSECURITYNAME,NTP_WARN,SEARCH_CRITERIA,TCP_CONN_WARN,LONG,SNMPVERSION,CHECK_HTTPS_PORT,LOAD_CRIT,STORAGE_CRIT,SSH_PASSWORD,HTTPS_LIST2,DEVICELOCATION,STORAGE_WARN,SNMPPRIVPROTOCOL,SNMPAUTHKEY,LAT,CHECK_HTTPS_DOMAIN_NAME,DEVICEDESC,TRAFFIC_CRITICAL_THRESHOLD,UPTIME_CRIT,CERT_LIST2,LOAD_WARN,FILES,TRAFFIC_WARNING_THRESHOLD,IFACES,MEMORY_CRIT,CHECK_HTTPS_URI,SNMPSECURITYLEVEL,MYSQLPASS,TCP_CONN_CRIT,CPU_CRIT,CHECK_HTTP_DOMAIN_NAME,IDENTIFIER_FIELD,CHECK_HTTP_URI,SNMPCONTEXT,CHECK_HTTP_PORT,WOCU_REALMS,CHECK_HTTP_AUTH,SSH_USER,CRITICAL_TIME_RESPONSE,MEMORY_WARN,NTP_CRIT,WARNING_TIME_RESPONSE,MYSQLUSER,DEVICEVENDOR
Results: 63 names with _BLABLA
query against LMD:
❯ cat query.txt| netcat localhost 50000
[[["SNMPCOMMUNITY","STORAGE_PATH","WOCU_DAEMONS","CHKLOG_CONF","CPU_WARN","API_USER","SNMP_MSG_MAX_SIZE","HTTP_LIST2","SNMPAUTHPROTOCOL","PSMEM_VISOR_DAEMONS","API_PASSWORD","CHECK_HTTPS_MINIMUM_DAYS","DEVICECONTACT","UPTIME_WARN","BLABLA","SNMPPRIVKEY","CHECK_HTTPS_AUTH","TCP_VISOR_SERVICES","SNMPSECURITYNAME","NTP_WARN","SEARCH_CRITERIA","TCP_CONN_WARN","LONG","SNMPVERSION","CHECK_HTTPS_PORT","LOAD_CRIT","STORAGE_CRIT","SSH_PASSWORD","HTTPS_LIST2","DEVICELOCATION","STORAGE_WARN","SNMPPRIVPROTOCOL","SNMPAUTHKEY","LAT","CHECK_HTTPS_DOMAIN_NAME","DEVICEDESC","TRAFFIC_CRITICAL_THRESHOLD","UPTIME_CRIT","CERT_LIST2","LOAD_WARN","FILES","TRAFFIC_WARNING_THRESHOLD","IFACES","MEMORY_CRIT","CHECK_HTTPS_URI","SNMPSECURITYLEVEL","MYSQLPASS","TCP_CONN_CRIT","CPU_CRIT","CHECK_HTTP_DOMAIN_NAME","IDENTIFIER_FIELD","CHECK_HTTP_URI","SNMPCONTEXT","CHECK_HTTP_PORT","WOCU_REALMS","CHECK_HTTP_AUTH","SSH_USER","CRITICAL_TIME_RESPONSE","MEMORY_WARN","NTP_CRIT","WARNING_TIME_RESPONSE","MYSQLUSER","DEVICEVENDOR"]]
]
Results: 63 names with _BLABLA. Still everything is OK.
query against shinken:
❯ cat query.txt | socat - UNIX-CONNECT:/var/opt/wocu/run/shinken/livestatus.sock
SNMPCOMMUNITY,STORAGE_PATH,WOCU_DAEMONS,CPU_WARN,API_USER,SNMP_MSG_MAX_SIZE,HTTP_LIST2,SNMPAUTHPROTOCOL,PSMEM_VISOR_DAEMONS,API_PASSWORD,CHECK_HTTPS_MINIMUM_DAYS,DEVICECONTACT,UPTIME_WARN,CHKLOG_CONF,SNMPPRIVKEY,CHECK_HTTPS_AUTH,TCP_VISOR_SERVICES,SNMPSECURITYNAME,NTP_WARN,SEARCH_CRITERIA,TCP_CONN_WARN,LONG,SNMPVERSION,CHECK_HTTPS_PORT,LOAD_CRIT,STORAGE_CRIT,SSH_PASSWORD,HTTPS_LIST2,DEVICELOCATION,STORAGE_WARN,SNMPPRIVPROTOCOL,SNMPAUTHKEY,LAT,CHECK_HTTPS_DOMAIN_NAME,DEVICEDESC,TRAFFIC_CRITICAL_THRESHOLD,UPTIME_CRIT,CERT_LIST2,LOAD_WARN,FILES,TRAFFIC_WARNING_THRESHOLD,IFACES,MEMORY_CRIT,CHECK_HTTPS_URI,SNMPSECURITYLEVEL,MYSQLPASS,TCP_CONN_CRIT,CPU_CRIT,CHECK_HTTP_DOMAIN_NAME,IDENTIFIER_FIELD,CHECK_HTTP_URI,SNMPCONTEXT,CHECK_HTTP_PORT,WOCU_REALMS,CHECK_HTTP_AUTH,SSH_USER,CRITICAL_TIME_RESPONSE,MEMORY_WARN,NTP_CRIT,WARNING_TIME_RESPONSE,MYSQLUSER,DEVICEVENDOR
Results: 62 names without _BLABLA
query against LMD:
❯ cat query.txt| netcat localhost 50000
[[["SNMPCOMMUNITY","STORAGE_PATH","WOCU_DAEMONS","CHKLOG_CONF","CPU_WARN","API_USER","SNMP_MSG_MAX_SIZE","HTTP_LIST2","SNMPAUTHPROTOCOL","PSMEM_VISOR_DAEMONS","API_PASSWORD","CHECK_HTTPS_MINIMUM_DAYS","DEVICECONTACT","UPTIME_WARN","BLABLA","SNMPPRIVKEY","CHECK_HTTPS_AUTH","TCP_VISOR_SERVICES","SNMPSECURITYNAME","NTP_WARN","SEARCH_CRITERIA","TCP_CONN_WARN","LONG","SNMPVERSION","CHECK_HTTPS_PORT","LOAD_CRIT","STORAGE_CRIT","SSH_PASSWORD","HTTPS_LIST2","DEVICELOCATION","STORAGE_WARN","SNMPPRIVPROTOCOL","SNMPAUTHKEY","LAT","CHECK_HTTPS_DOMAIN_NAME","DEVICEDESC","TRAFFIC_CRITICAL_THRESHOLD","UPTIME_CRIT","CERT_LIST2","LOAD_WARN","FILES","TRAFFIC_WARNING_THRESHOLD","IFACES","MEMORY_CRIT","CHECK_HTTPS_URI","SNMPSECURITYLEVEL","MYSQLPASS","TCP_CONN_CRIT","CPU_CRIT","CHECK_HTTP_DOMAIN_NAME","IDENTIFIER_FIELD","CHECK_HTTP_URI","SNMPCONTEXT","CHECK_HTTP_PORT","WOCU_REALMS","CHECK_HTTP_AUTH","SSH_USER","CRITICAL_TIME_RESPONSE","MEMORY_WARN","NTP_CRIT","WARNING_TIME_RESPONSE","MYSQLUSER","DEVICEVENDOR"]]
]
Results: 63 names WITH _BLABLA and columns in different order
As I said in previous messages, a full restart of LMD fixes the issue getting the right number of custom variables and with the expected order.
Let me now @sni if I can test another thing in order to help you.
Ok, as defined in https://github.com/sni/lmd/blob/master/lmd/objects.go#L446-L447 the variable names are static and only
refreshed on backend reload. The values are synchronized constantly. So if you restart shinken and the old variabe is still there, it bascially means LMD does not recognize the shinken reload at all.
Could you run the query GET status\nOutputFormat: json\nColumns: program_start nagios_pid\n
before and after a reload of the core? Because thats the 2 things LMD checks to detect a backend reload.
could you verify if shinken returns the custom variables always in the same order? So if you fetch custom_variable_names and custom_variable_values directly from shinken, does it return the same values/order all the time?
Forgot it. If I do not change the configuration, the results are always in the same order.
Ok, as defined in https://github.com/sni/lmd/blob/master/lmd/objects.go#L446-L447 the variable names are static and only refreshed on backend reload. The values are synchronized constantly. So if you restart shinken and the old variabe is still there, it bascially means LMD does not recognize the shinken reload at all.
Understood.
Could you run the query
GET status\nOutputFormat: json\nColumns: program_start nagios_pid\n
before and after a reload of the core? Because thats the 2 things LMD checks to detect a backend reload.
program_start
changes its value but nagios_pid
does not.
[[1616067095,11292]] [[1616071392,11292]]
This is because of the shinken architecture, only the arbiter daemon (responsible of managing the configuration) is restarted, the other daemons just receive the new configuration.
The nagios_pid
corresponds to the shinken-scheduler daemon which is not restarted for updating the configuration.
More info: https://shinken.readthedocs.io/en/latest/09_architecture/the-shinken-architecture.html
That's ok. You should see a backend restarting message in LMDs logfile then. Could it be, that the program_start
is changed to early in shinken due to its architecture? What i mean is the following scenario. You reload shinken, program_start is updated to current timestamp, but propagation of the objects takes a bit longer and lmd is finished with the initial sync already (including the old custom variables).
This could be tested easily if a second reload (with a small delay) fixes the issue.
I think you are right about the behavior you've just described. It's a strange behavior and it's quite difficult to reproduce.
But, even though I understand your point, we don't control the users of our applications, right? It is not just a question of restarting again a few seconds (minutes) later.
I think that custom_variable_names
and custom_variable_values
are heavily related, so they should be both dynamic or static in their definition to avoid this kind of mismatches. But sure there is something related to performance that I am missing.
well, LMD tries to minimize the data it has to synchronize so it separates things that are static and things that change during runtime. You are seeing this issue with custom variables, but it then might be the case with all other "static" parts of the configuration as well.
Maybe this could be solved in shinken if they update the program_start
after updating everything, not before. Or
they know any other way how to correctly determine if shinken has finished reloading its config. Last ressort would
be to just delay refetching the objects by a dew seconds, but thats clearly just a bad hack.
Maybe this could be solved in shinken if they update the program_start after updating everything
I think so, it may be too early because the configuration is not yet fully managed and distributed to the broker daemon at this point: https://github.com/naparuba/shinken/blob/master/shinken/scheduler.py#L162
yeah, would be better to set the program_start at the end. But now idea if there are any side effects on shinkens side then.
Here is an example:
names and values don't match each other:
Restarting lmd (full-scan) fixes the problem but in a few minutes, the list are messed up again, so the bug should be in the delta-update.
Version 1.9.1 is working fine.