tango-controls / TangoSourceDistribution

4 stars 5 forks source link

DataBaseds crashes #61

Closed Diego91RA closed 4 years ago

Diego91RA commented 4 years ago

Hello again.

I've noticed that my TangoDB proccess crashes after some time, about 5+ hours.

Debian 9.11, Tango 9.3.4-rc2, omniORB 4.2.3, two different VMs

Here are last lines of output for /usr/local/bin/DataBaseds 2 -ORBendPoint giop:tcp::10000 -v5 after launching Jive:

1577081481 [139768269588224] DEBUG dserver/DataBaseds/2 Sleep for : 1800000 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 DeviceImpl::name arrived 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 Leaving DeviceImpl::name 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 DeviceImpl::name arrived 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 Leaving DeviceImpl::name 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 Device_4Impl::command_inout_4 arrived, source = 2, command = DbGetProperty 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 Device_2Impl::command_inout_2 arrived, source = 2, command = DbGetProperty 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 In get_monitor() cache, thread = 13, ctr = 0 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 In rel_monitor() cache, ctr = 1, thread = 13 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 Signalling ! 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 DeviceImpl::command_inout(): command received : DbGetProperty 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 SubDevDiag::get_associated_device() entering ... 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 SubDevDiag::get_associated_device() found : No associated device name! 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 SubDevDiag::set_associated_device() entering ... 1577082037 [139768244410112] DEBUG dserver/DataBaseds/2 Entering DeviceClass::command_handler() method 1577082037 [139768244410112] DEBUG sys/database/2 DataBase::always_executed_hook() sys/database/2 1577082037 [139768244410112] INFO dserver/DataBaseds/2 DbGetPropertyClass::execute(): arrived 1577082037 [139768244410112] DEBUG sys/database/2 DataBase::DbGetProperty() - sys/database/2 1577082037 [139768244410112] INFO sys/database/2 DataBase::db_get_property(): get 1 properties for object CtrlSystem 1577082037 [139768244410112] DEBUG sys/database/2 DataBase::replace_wildcard() wildcard in Services 1577082037 [139768244410112] DEBUG sys/database/2 DataBase::replace_wildcard() wildcard out Services 1577082037 [139768244410112] DEBUG sys/database/2 DataBase::db_get_property(): sql_query SELECT count,value,name FROM property WHERE object = "CtrlSystem" AND name LIKE "Services" ORDER BY count 1577082037 [139768244410112] DEBUG sys/database/2 Using MySQL connection with semaphore 0 unknown: debugger aborting because missing DBUG_RETURN or DBUG_VOID_RETURN macro in function "?func"

Happy holidays, sorry for disturbing :)

bourtemb commented 4 years ago

Hi @Diego91RA , the "debugger aborting error" might be a bug in the mariaDB version you are using?:

https://jira.mariadb.org/browse/MDEV-8316 https://jira.mariadb.org/browse/MDEV-12276

What MariaDB version are you using?

Diego91RA commented 4 years ago

Hi. I'm using version 10.1.41 installed from official Debian repo, not the latest patch for this minor version, but was released last summer. The bug report you've mentioned says the problem was fixed in 10.1.23 in 2017.

On another VM MariaDB was built from sources one year ago, version 10.3.12, but Tango is 9.2.5a, and I don't have such problem. I'll try to build and test it later this week.

Thanks.

Diego91RA commented 4 years ago

I've updated MariaDB to 10.4.11 from official repos https://downloads.mariadb.org/mariadb/repositories/

And still have exactly the same error :(

bourtemb commented 4 years ago

Do you have a core file? Can you please share the backtrace?

Diego91RA commented 4 years ago

Here is gdb backtrace

>1578551056 [140737001805568] DEBUG sys/database/2 Using MySQL connection with semaphore 0
unknown: debugger aborting because missing DBUG_RETURN or DBUG_VOID_RETURN macro in function "?func"
Thread 13 "DataBaseds" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe2ffd700 (LWP 9792)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff4d8a42a in __GI_abort () at abort.c:89
#2  0x00007ffff7348f67 in ?? () from /usr/lib/x86_64-linux-gnu/libmariadb.so.2
#3  0x00007ffff7356755 in ?? () from /usr/lib/x86_64-linux-gnu/libmariadb.so.2
#4  0x00007ffff735eb11 in ?? () from /usr/lib/x86_64-linux-gnu/libmariadb.so.2
#5  0x00007ffff735edad in ?? () from /usr/lib/x86_64-linux-gnu/libmariadb.so.2
#6  0x00007ffff735a149 in mysql_real_query () from /usr/lib/x86_64-linux-gnu/libmariadb.so.2
#7  0x00005555555e0bb2 in DataBase_ns::DataBase::query (this=this@entry=0x555555881380, 
    sql_query="SELECT count,value,name FROM property WHERE object = \"CtrlSystem\" AND name LIKE \"Services\" ORDER BY count", 
    method=method@entry=0x5555555eeff7 "db_get_property()", con_nb=<optimized out>, con_nb@entry=-1) at DataBaseUtils.cpp:568
#8  0x0000555555595fdc in DataBase_ns::DataBase::db_get_property (this=0x555555881380, argin=0x7fffd0005a90) at DataBase.cpp:5149
#9  0x0000555555573bc1 in DataBase_ns::DbGetPropertyClass::execute (this=0x55555587b870, device=0x555555881380, in_any=...)
    at DataBaseClass.cpp:1206
#10 0x00007ffff7913824 in Tango::DeviceClass::command_handler (this=0x555555874e50, device=device@entry=0x555555881380, 
    command="DbGetProperty", in_any=...) at deviceclass.cpp:1197
#11 0x00007ffff78ccc75 in Tango::DeviceImpl::command_inout (this=0x555555881380, in_cmd=0x7fffd00052e0 "DbGetProperty", in_any=...)
    at device.cpp:1537
#12 0x00007ffff78eb9cd in Tango::Device_2Impl::command_inout_2 (this=0x555555881380, in_cmd=<optimized out>, in_data=..., 
    source=Tango::CACHE_DEV) at device_2.cpp:438
#13 0x00007ffff7903de6 in Tango::Device_4Impl::command_inout_4 (this=0x555555881380, in_cmd=0x7fffd00052e0 "DbGetProperty", 
    in_data=..., source=Tango::CACHE_DEV, cl_id=...) at device_4.cpp:473
#14 0x00007ffff7a8e242 in _0RL_lcfn_6fe2f94a21a10053_a3000000 (cd=0x7fffe2ffc9c0, svnt=<optimized out>) at tangoSK.cpp:5383
#15 0x00007ffff69c3d19 in omniCallHandle::upcall(omniServant*, omniCallDescriptor&) () from /usr/local/lib/libomniORB4.so.2
#16 0x00007ffff7a9d5f7 in Tango::_impl_Device_4::_dispatch (this=<optimized out>, _handle=...) at tangoSK.cpp:5958
#17 0x00007ffff7a9e8ea in Tango::_impl_Device_5::_dispatch (this=<optimized out>, _handle=...) at tangoSK.cpp:7478
#18 0x00007ffff69bc6fd in omni::omniOrbPOA::dispatch(omniCallHandle&, omniLocalIdentity*) () from /usr/local/lib/libomniORB4.so.2
#19 0x00007ffff6996c77 in omniLocalIdentity::dispatch(omniCallHandle&) () from /usr/local/lib/libomniORB4.so.2
#20 0x00007ffff69e3e40 in omni::GIOP_S::handleRequest() () from /usr/local/lib/libomniORB4.so.2
#21 0x00007ffff69e4e58 in omni::GIOP_S::dispatcher() () from /usr/local/lib/libomniORB4.so.2
#22 0x00007ffff69e0e55 in omni::giopWorker::execute() () from /usr/local/lib/libomniORB4.so.2
#23 0x00007ffff6989ce8 in omniAsyncWorker::real_run() () from /usr/local/lib/libomniORB4.so.2
#24 0x00007ffff698adef in omniAsyncPoolServer::workerRun(omniAsyncWorker*) () from /usr/local/lib/libomniORB4.so.2
#25 0x00007ffff698995b in omniAsyncWorker::mid_run() () from /usr/local/lib/libomniORB4.so.2
#26 0x00007ffff7a6796c in Tango::create_PyPerThData (info=...) at utils.cpp:3255
#27 0x00007ffff698ad8a in omniAsyncWorker::run(void*) () from /usr/local/lib/libomniORB4.so.2
#28 0x00007ffff5bb2749 in omni_thread_wrapper () from /usr/local/lib/libomnithread.so.4
#29 0x00007ffff59994a4 in start_thread (arg=0x7fffe2ffd700) at pthread_create.c:456
#30 0x00007ffff4e3ed0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
bourtemb commented 4 years ago

Thank you very much for the backtrace. The problem seems to come from libmariadb library or on the way the OS is configured. I found this recent bug reported on Gentoo: https://bugs.gentoo.org/701710 So other people have been seeing the same error recently. Do you have a VM you could share to reproduce easily the problem?

Diego91RA commented 4 years ago

Oops, VM I've shared before was with Debian 10.

Here is for Debian 9 https://yadi.sk/d/E4zF1XHcZ9VAlw MariaDB 10.1.41

bourtemb commented 4 years ago

You wrote the problem is occurring after about 5+ hours. @Diego91RA Do you do anything special during these 5 hours or is the system rather idle? Does the DataBaseds receive many commands?

Diego91RA commented 4 years ago

No, just executed /usr/local/bin/DataBaseds 2 -ORBendPoint giop:tcp::10000 -v5 and run jive from time to time to check if process crash after I run it.

bourtemb commented 4 years ago

I had a look at the logs in /var/log/mysql in the VM you provided and I saw some messages like the following:

2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: The InnoDB memory heap is disabled
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Compressed tables use zlib 1.2.8
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Using Linux native AIO
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Using SSE crc32 instructions
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Completed initialization of buffer pool
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: Highest supported file format is Barracuda.
2020-01-09 14:27:12 140612343242112 [Note] InnoDB: The log sequence number 1616767 in ibdata file do not match the log sequence number 1616777 in the ib_logfiles!
2020-01-09 14:27:13 140612343242112 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2020-01-09 14:27:13 140612343242112 [Note] InnoDB: 128 rollback segment(s) are active.
2020-01-09 14:27:13 140612343242112 [Note] InnoDB: Waiting for purge to start
2020-01-09 14:27:13 140612343242112 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.44-86.0 started; log sequence number 1616777
2020-01-09 14:27:13 140611659224832 [Note] InnoDB: Dumping buffer pool(s) not yet started
2020-01-09 14:27:13 140612343242112 [Note] Plugin 'FEEDBACK' is disabled.
2020-01-09 14:27:13 140612343242112 [Note] Recovering after a crash using tc.log
2020-01-09 14:27:13 140612343242112 [Note] Starting crash recovery...
2020-01-09 14:27:13 140612343242112 [Note] Crash recovery finished.
2020-01-09 14:27:13 140612343242112 [Note] Server socket created on IP: '127.0.0.1'.
2020-01-09 14:27:13 140612343242112 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.1.41-MariaDB-0+deb9u1'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Debian 9.9

This seems to indicate that the MariaDB server detected a crash yesterday before you sent the VM. I now get some error messages like this one in the latest /var/log/mysql logs:

2020-01-10 17:12:53 140479346779904 [ERROR] mysqld: Table './tango/property_class' is marked as crashed and should be repaired
2020-01-10 17:12:53 140479346779904 [Warning] Checking table:   './tango/property_class'
2020-01-10 17:12:54 140479346779904 [ERROR] mysqld: Table './tango/property_device' is marked as crashed and should be repaired
2020-01-10 17:12:54 140479346779904 [Warning] Checking table:   './tango/property_device'
2020-01-10 17:12:54 140479346779904 [ERROR] mysqld: Table './tango/property_device_hist' is marked as crashed and should be repaired
2020-01-10 17:12:54 140479346779904 [Warning] Checking table:   './tango/property_device_hist'
2020-01-10 17:12:54 140479346779904 [ERROR] mysqld: Table './tango/device' is marked as crashed and should be repaired
2020-01-10 17:12:54 140479346779904 [Warning] Checking table:   './tango/device'
bourtemb commented 4 years ago

How much RAM and CPUs did you allocate for your VM?

bourtemb commented 4 years ago

Hi, I managed to reproduce the DataBaseds crash with your VM. Thanks! And I even managed to reproduce it much faster and reliably. I edited mysql configuration and added the following lines in /etc/mysql/mariadb.conf.d under [mysqld] section :

wait_timeout        = 300
interactive_timeout = 300

I then restarted the mysql (mariadb) daemon and the DataBaseds. I got the crash after 5 minutes only (no query to the DataBaseds during at least 5 minutes). The default for these timeout variables is 28800 seconds on your VM (8 hours). I changed the timeout to 60 and the problem can appear after 1 minute of DataBaseds inactivity. If I keep playing with jive, the DataBaseds can stay alive more than 1 minute with these timeout variables set to 60 s.

So it looks like the crash occurs when the connection between the DataBaseDs and mysql is idle for too long and mysql has closed the idle connection?

Diego91RA commented 4 years ago

Hello. So the problem is not in the VM resources. Just in case - for the VirtualBox I have 2 CPU and 4 GB RAM, for Proxmox VM on our cluster - 4 CPU and 8 GB RAM.

I see lines in the console output for DatabaseDS during the idle:

1578914714 [140634025568000] DEBUG dserver/DataBaseds/2 ----------> Time = 576914714,021918 Store sub device property data if needed! 1578914714 [140634025568000] DEBUG dserver/DataBaseds/2 SubDevDiag::store_sub_devices() entering ... 1578914714 [140634025568000] DEBUG dserver/DataBaseds/2 Sub device property storage, next wake_up at 576916514,022783 1578914714 [140634025568000] DEBUG dserver/DataBaseds/2 Sleep for : 1800000 1578916514 [140634025568000] DEBUG dserver/DataBaseds/2 ----------> Time = 576916514,022117 Store sub device property data if needed! 1578916514 [140634025568000] DEBUG dserver/DataBaseds/2 SubDevDiag::store_sub_devices() entering ... 1578916514 [140634025568000] DEBUG dserver/DataBaseds/2 Sub device property storage, next wake_up at 576918314,022783 1578916514 [140634025568000] DEBUG dserver/DataBaseds/2 Sleep for : 1800000

Does it mean that DS stores data to the DB every 30 minutes? That should be enough for MySQL to keep the connection alive, if default timeout is 8 hours.

bourtemb commented 4 years ago

1578916514 [140634025568000] DEBUG dserver/DataBaseds/2 ----------> Time = 576916514,022117 Store sub device property data if needed!

This line indicates that it does store data only if needed. I think when nothing happens as in your use case, there is no new sub device to register, so no query to the DB.

Diego91RA commented 4 years ago

Ok. So what can I do to get around this problem - set timeout values to maximum 31536000, which is 365 days, and wait for someone eventually edit something in the TangoDB? Or create daemon that will edit some property, e.g. write timestamp, so it will be stored in the MySQL like keep-alive package?

Btw, is it bug or feature? Because usually I did nothing with default MySQL/MariaDB configuration and everything worked fine on different VMs, distributions and Tango versions.

bourtemb commented 4 years ago

Ok. So what can I do to get around this problem - set timeout values to maximum 31536000, which is 365 days, and wait for someone eventually edit something in the TangoDB? Or create daemon that will edit some property, e.g. write timestamp, so it will be stored in the MySQL like keep-alive package?

Having a Starter running and a TangoTest started once on your host and stopped should be enough. The Starter will try to reconnect regularly to TangoTest and will query the DB. This is probably why nobody ever complained about that issue.

Btw, is it bug or feature? Because usually I did nothing with default MySQL/MariaDB configuration and everything worked fine on different VMs, distributions and Tango versions.

I have the feeling it is a bug in libmariadbclient library. I need more time to investigate a bit deeper.

bourtemb commented 4 years ago

I confirm this bug is a libmariadb2 bug. I downloaded mariadb-connector-c-2.3.7 (latest libmariadb2 version I could find) and compiled it in debug mode.

I reproduced the crash and I got the following backtrace with gdb:

Core was generated by `./Databaseds 2 -ORBendPoint giop:tcp::10000 -v5'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7fc63dffb700 (LWP 30848))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fc64860242a in __GI_abort () at abort.c:89
#2  0x00007fc649231c6b in DbugExit (why=0x7fc63dff8d90 "missing DBUG_RETURN or DBUG_VOID_RETURN macro in function \"?func\"\n") at /home/tango/mariadbConnector/mariadb-connector-c-2.3.7-src/libmariadb/dbug.c:2126
#3  0x00007fc64923074c in _db_return_ (_line_=2186, _stack_frame_=0x7fc63dff8fc0) at /home/tango/mariadbConnector/mariadb-connector-c-2.3.7-src/libmariadb/dbug.c:1295
#4  0x00007fc6492387d0 in mysql_reconnect (mysql=0x5617f7d19790) at /home/tango/mariadbConnector/mariadb-connector-c-2.3.7-src/libmariadb/libmariadb.c:2186
#5  0x00007fc649234537 in mthd_my_send_cmd (mysql=0x5617f7d19790, command=MYSQL_COM_QUERY, arg=0x7fc6240055a0 "SELECT count,value,name FROM property WHERE object = \"CtrlSystem\" AND name LIKE \"Services\" ORDER BY count", length=105, 
    skipp_check=1 '\001', opt_arg=0x0) at /home/tango/mariadbConnector/mariadb-connector-c-2.3.7-src/libmariadb/libmariadb.c:622
#6  0x00007fc64923466c in simple_command (mysql=0x5617f7d19790, command=MYSQL_COM_QUERY, arg=0x7fc6240055a0 "SELECT count,value,name FROM property WHERE object = \"CtrlSystem\" AND name LIKE \"Services\" ORDER BY count", length=105, 
    skipp_check=1 '\001', opt_arg=0x0) at /home/tango/mariadbConnector/mariadb-connector-c-2.3.7-src/libmariadb/libmariadb.c:646
#7  0x00007fc6492398c9 in mysql_real_query (mysql=0x5617f7d19790, query=0x7fc6240055a0 "SELECT count,value,name FROM property WHERE object = \"CtrlSystem\" AND name LIKE \"Services\" ORDER BY count", length=105)
    at /home/tango/mariadbConnector/mariadb-connector-c-2.3.7-src/libmariadb/libmariadb.c:2508
#8  0x00005617f79ffc0e in DataBase_ns::DataBase::query (this=this@entry=0x5617f7d0cd20, sql_query="SELECT count,value,name FROM property WHERE object = \"CtrlSystem\" AND name LIKE \"Services\" ORDER BY count", 
    method=method@entry=0x5617f7a0e2e6 "db_get_property()", con_nb=<optimized out>, con_nb@entry=-1) at /home/tango/GIT/TangoDatabase/DataBaseUtils.cpp:568
#9  0x00005617f79c839e in DataBase_ns::DataBase::db_get_property (this=<optimized out>, argin=<optimized out>) at /home/tango/GIT/TangoDatabase/DataBase.cpp:5149
#10 0x00005617f79f2f1f in DataBase_ns::DbGetPropertyClass::execute (this=0x5617f7d07210, device=0x5617f7d0cd20, in_any=...) at /home/tango/GIT/TangoDatabase/DataBaseClass.cpp:1206
#11 0x00007fc64a963824 in Tango::DeviceClass::command_handler (this=0x5617f7d007f0, device=device@entry=0x5617f7d0cd20, command="DbGetProperty", in_any=...) at deviceclass.cpp:1197
#12 0x00007fc64a91cc75 in Tango::DeviceImpl::command_inout (this=0x5617f7d0cd20, in_cmd=0x7fc624002da0 "DbGetProperty", in_any=...) at device.cpp:1537
#13 0x00007fc64a93b9cd in Tango::Device_2Impl::command_inout_2 (this=0x5617f7d0cd20, in_cmd=<optimized out>, in_data=..., source=Tango::CACHE_DEV) at device_2.cpp:438
#14 0x00007fc64a953de6 in Tango::Device_4Impl::command_inout_4 (this=0x5617f7d0cd20, in_cmd=0x7fc624002da0 "DbGetProperty", in_data=..., source=Tango::CACHE_DEV, cl_id=...) at device_4.cpp:473
#15 0x00007fc64aade242 in _0RL_lcfn_6fe2f94a21a10053_a3000000 (cd=0x7fc63dffa9c0, svnt=<optimized out>) at tangoSK.cpp:5383
#16 0x00007fc649a2fd19 in omniCallHandle::upcall(omniServant*, omniCallDescriptor&) () from /usr/local/lib/libomniORB4.so.2
#17 0x00007fc64aaed5f7 in Tango::_impl_Device_4::_dispatch (this=<optimized out>, _handle=...) at tangoSK.cpp:5958
#18 0x00007fc64aaee8ea in Tango::_impl_Device_5::_dispatch (this=<optimized out>, _handle=...) at tangoSK.cpp:7478
#19 0x00007fc649a286fd in omni::omniOrbPOA::dispatch(omniCallHandle&, omniLocalIdentity*) () from /usr/local/lib/libomniORB4.so.2
#20 0x00007fc649a02c77 in omniLocalIdentity::dispatch(omniCallHandle&) () from /usr/local/lib/libomniORB4.so.2
#21 0x00007fc649a4fe40 in omni::GIOP_S::handleRequest() () from /usr/local/lib/libomniORB4.so.2
#22 0x00007fc649a50e58 in omni::GIOP_S::dispatcher() () from /usr/local/lib/libomniORB4.so.2
#23 0x00007fc649a4ce55 in omni::giopWorker::execute() () from /usr/local/lib/libomniORB4.so.2
#24 0x00007fc6499f5ce8 in omniAsyncWorker::real_run() () from /usr/local/lib/libomniORB4.so.2
#25 0x00007fc6499f6def in omniAsyncPoolServer::workerRun(omniAsyncWorker*) () from /usr/local/lib/libomniORB4.so.2
#26 0x00007fc6499f595b in omniAsyncWorker::mid_run() () from /usr/local/lib/libomniORB4.so.2
#27 0x00007fc64aab796c in Tango::create_PyPerThData (info=...) at utils.cpp:3255
#28 0x00007fc6499f6d8a in omniAsyncWorker::run(void*) () from /usr/local/lib/libomniORB4.so.2
#29 0x00007fc64972e749 in omni_thread_wrapper () from /usr/local/lib/libomnithread.so.4
#30 0x00007fc647f9c4a4 in start_thread (arg=0x7fc63dffb700) at pthread_create.c:456
#31 0x00007fc6486b6d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

I then searched on the web for libmariadb bugs related to mysql_reconnect method and I found this interesting bug reported on Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880197#10

The reply from the upstream author is interesting:

we fixed some missing dbug return stuff - the solution is to disable the debug library, it has too many bad side effects (That's why we removed it from C/C 3.0).

Afaik CMake has no settings for, but a export CFLAGS=-DDBUG_OFF should do the job.

I recompiled libmariadb.so.2 with export CFLAGS=-DDBUG_OFF and I confirm the issue disappears.

It looks like there is no libmariadb3 package in Debian Stretch so you can recompile manually libmariadb.so.2 using export CFLAGS=-DDBUG_OFF or try to compile libmariadb.so.3 (I didn't try that). Or you can ensure that there is at least one query to the Database before the end of the wait_timeout and interactive_timeout. In our control system, the DataBaseds is running on Debian 9 too but we never encountered this issue because we have thousands of devices and there is always some activity on the database.

t-b commented 4 years ago

I did not had any issue on debian unstable with libmariadb3 1:10.3.21-2 (https://packages.debian.org/sid/libmariadb3) and I had the VM instance running for three days with little interaction.

bourtemb commented 4 years ago

According to the libmariadb upstream author, the problem is not present in libmariadb3

t-b commented 4 years ago

@bourtemb Should we add a configure check for compatible libmardiadb versions?

Diego91RA commented 4 years ago

Debian 9 has only libmariadb2 in the official repo by default. You have to add MariaDB repos or change yours to "sid" for libmariadb3. I've found it on another VM with updated version https://github.com/tango-controls/TangoSourceDistribution/issues/61#issuecomment-572106149 along with libmariadb2 from apt installation. DatabaseDS was compiled with libmariadb2, so update didn't change anything and I still got the same error, so I'm recompiling Tango now.

I think version check and an advice to install MariaDB from their sources will be useful.

bourtemb commented 4 years ago

I did a test with libmariadbclient18 and it seems like the problem is not present in this version. I realized this is actually the version we are currently using on our Debian 9 systems.

The problem will not be present in libmariadb3 because all these debug macros which were causing this kind of crashes were removed. So it could be good to add a check and display an advice or warning if libmariadb2 is detected indeed. @t-b, could you please take care of that?

Diego91RA commented 4 years ago

Well, DS is working fine now for 2+ hours, even with wait_timeout and interactive_timeout set to 60 seconds.

Thanks for your help.

t-b commented 4 years ago

@bourtemb Will do.