ikzelf / zbxdb

Zabbix database monitoring, the easy and extendable way
GNU General Public License v3.0
94 stars 45 forks source link

Previous run still running error #55

Closed ddmo closed 4 years ago

ddmo commented 4 years ago

Hi, I got this error some minutes after start the script:

2020-10-19 09:39:03,216___main__ Logging in /home/zbxdb/log/zbxdb_sender.log
2020-10-19 09:39:03,217___main__ Namespace(cfile='/etc/zabbix/zabbix_agentd.conf', verbosity=0, zbxdb_out='zbxora_out')
2020-10-19 09:39:03,217___main__ 2020-10-19-0939 previous run still running(or crashed(lock file: /home/zbxdb/zbxdb_sender/zbxdb_sender.lock))

I've tried to stop all, kill the process, remove the lock file but, after restart, I've got the same results. Can you help me? Thanks

ikzelf commented 4 years ago

Can you increase the verbosity (-vv) The possibility exists there was still a process running, the other option is that the process crashed. What the process should do is 0) lock 1) move all files from zbxora_out/ to zbxdb_sender/in/ 2) for every file in zbxdb_sender/in/ send to zabbix and add to an archive in zbxdb_sender/archive/ 3) unlock 4) clean older archives

ddmo commented 4 years ago

Any additional information with -vv parameter (these log after restart and clean all):

2020-10-19 10:03:01,995___main___30_Logging in /home/zbxdb/log/zbxdb_sender.log 2020-10-19 10:03:01,996___main___30_Namespace(cfile='/etc/zabbix/zabbix_agentd.conf', verbosity=2, zbxdb_out='zbxora_out') 2020-10-19 10:03:01,997___main___30_Using /etc/zabbix/zabbix_agentd.conf 2020-10-19 10:03:01,997___main___30_2020-10-19-1003 processing zbxdb.odb.zbx 2020-10-19 10:03:02,045___main___40_zabbix_sender zbxdb.odb.zbx error: 2 2020-10-19 10:04:02,192___main___30_Logging in /home/zbxdb/log/zbxdb_sender.log 2020-10-19 10:04:02,194___main___30_Namespace(cfile='/etc/zabbix/zabbix_agentd.conf', verbosity=2, zbxdb_out='zbxora_out') 2020-10-19 10:04:02,194___main___30_2020-10-19-1004 previous run still running(or crashed(lock file: /home/zbxdb/zbxdb_sender/zbxdb_sender.lock))

The sender script fails because any new collected data after crash. Maybe the process crash during unlock operation? How to verify? Thanks a lot

ikzelf commented 4 years ago

The run from 10:03 seems to stop for some reason. After zabbix_sender returned error code 2 (sending failed) should continue with archiving. Did errors pop-up in the other zbxdb_sender log files? Could it be that there is still a process running that started around 10:03? It would help to see that.

ikzelf commented 4 years ago

what is in zbxdb_sender/archive/?

ddmo commented 4 years ago

Any other messages. The process is still running: zbxdb 50902 1 0 09:53 ? 00:00:00 /home/zbxdb/.pyenv/versions/3.6.5/bin/python3 /home/zbxdb/zbxdb/bin/zbxdb.py -c etc/zbxdb.odb.cfg

but since last error, any data collected in archive directory.

ikzelf commented 4 years ago

This looks like OK, The zbxdb.py process[es] should keep running, the zbxdb_sender.py only once a minute to send the files generated by zbxdb.py.

ddmo commented 4 years ago

mmm ok... Any additional checks? Thanks

ikzelf commented 4 years ago

At this moment I still have no idea what to do for extra checks .... but I am very open for suggestions that make this work better.

ddmo commented 4 years ago

I've tried to send manually data to zabbix server. Here the response:

` zbxdb@cszabbix:~$ zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -T -i /home/zbxdb/zbxdb_sender/archive/2020-10-19-1219/zbxdb.odb.zbx

Response from "192.168.97.9:10051": "processed: 31; failed: 219; total: 250; seconds spent: 0.002043"

Response from "192.168.97.9:10051": "processed: 136; failed: 112; total: 248; seconds spent: 0.002767"

sent: 498; skipped: 0; total: 498

zbxdb@cszabbix:~$ echo $?

2 `

I think it might be the problem...

ikzelf commented 4 years ago

That tells there is a problem/mismatch between what is sent and what is known by the zabbix server. For example, once an hour a table spaces discovery is done but the data is just blindly sent to zabbix. After a while the table spaces will become known. There are some 28 discoveries .... This should not be the reason to stop/crash without an orderly exit where the lock file is removed.

ikzelf commented 4 years ago

How is your status now? Is your data coming in, like it should be? And as always, if you have suggestions how to improve, I would be more than happy to make things better but just need a bit of input.

ddmo commented 4 years ago

Hi! Now the script seem to be ok but I don't know the reason :) Thanks a lot