signalwire / freeswitch

FreeSWITCH is a Software Defined Telecom Stack enabling the digital transformation from proprietary telecom switches to a versatile software implementation that runs on any commodity hardware. From a Raspberry PI to a multi-core server, FreeSWITCH can unlock the telecommunications potential of any device.
https://freeswitch.com/#getting-started
Other
3.57k stars 1.42k forks source link

core db gets corrupted #420

Closed mbodbg closed 1 year ago

mbodbg commented 4 years ago

We configured an in-memory filesystem for the freeswitch core db. (according to https://freeswitch.org/confluence/display/FREESWITCH/Performance+Testing+and+Configurations#PerformanceTestingandConfigurations-FreeSWITCH'score.dbI/OBottleneck)

We have the situation that Freeswitch is running about 2 weeks properly and then suddenly the core db gets corrupted. The log fills up with the following error messages:

2020-02-25 20:25:30.151902 [ERR] switch_core_sqldb.c:732 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [file is encrypted or is not a database]
BEGIN EXCLUSIVE
2020-02-25 20:25:30.151902 [CRIT] switch_core_sqldb.c:2163 ERROR [file is encrypted or is not a database], [db="/dev/shm/core.db",type="core_db"]
2020-02-25 20:25:30.151902 [ERR] switch_core_sqldb.c:732 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [cannot commit - no transaction is active]
COMMIT

We use the following version:

FreeSWITCH Version 1.10.2-release-13-f7bdd3845a~64bit (-release-13-f7bdd3845a 64bit) on 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 (2019-11-11) x86_64 GNU/Linux

Is there something we can check, or is it better to use a local postgres db for the core db to avoid such error?

huongbn commented 4 years ago

Using postgresl to improve this problem

presidentpatel commented 4 years ago

Hello, We are also facing similar issue.. SQLlite is getting corrupted and locked.. so we have to stop freeswitch and delete existing file and restart then only it is working and it is intermittent issue .. so what could be reason behind this issue ? can any one faced or facing such issue... ADVANCE THANX>... Vipul

outruner commented 4 years ago

hello same issue here. It appears only with wss connections.

huongbn commented 4 years ago

You can use:

On Thu, Sep 24, 2020 at 3:32 PM outruner notifications@github.com wrote:

hello same issue here. It appears only with wss connections.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/signalwire/freeswitch/issues/420#issuecomment-698200453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH5QL7TJ2WAOM7ALQTYP27DSHL73PANCNFSM4K3RNRQA .

themsley-voiceflex commented 1 year ago

I have this problem on freeswitch 1.10.10 using wss and so far I have looked at 2 copies of the corrupted database and both of them have what looks like 47 or 48 bytes of overlaid storage at the start of the file. In both cases the first 4 bytes of the file are 0x1703030012. I was hunting around and found a link to https://engineering.fb.com/2014/08/12/ios/debugging-file-corruption-on-ios/ from https://www.sqlite.org/howtocorrupt.html and the similarities are very strong. They state that they had an overlay and "Using a hex analyzer, we found a common prefix across the attachments: 17 03 03 00 28.". In our case it is 0x1703030012 so not identical but close enough to make me think it will be a very similar problem.

themsley-voiceflex commented 1 year ago

Seems it happened again while I was looking. Again, od -X /dev/shm/core.db starts with

0000000 00030317 cdc06712 26b9d175 b91379ec

andywolk commented 1 year ago

Try updating sqlite to the latest.

themsley-voiceflex commented 1 year ago

It is already the latest available for RHEL 9: [root@fstrtc01 tmp]# rpm -q sqlite sqlite-3.34.1-6.el9_1.x86_64

andywolk commented 1 year ago

3.34 is too old

themsley-voiceflex commented 1 year ago

Could you please read the Facebook engineering post I linked to and see the similarities there, most especially the fact that the overwrite is at the start of the file where the "SQLite format 3" eyecatcher has been overwritten completely and the data there starts 0x1703030012. They blame a race condition in their code that writes to a filehandle that has been closed and then an SSL_write to that reused filehandle overwrites their SQLite database file.

I now have 3 separate copies of a corrupt core.db and in each of them the first 5 bytes are always that.

andywolk commented 1 year ago

The latest sqlite is 3.43.2 at the moment https://www.sqlite.org/index.html

themsley-voiceflex commented 1 year ago

I do not think an update to SQLite is possible on a RHEL system without breaking lots of stuff. The package manager and half the distro tools are written to use the version of SQLite that ships with the distro. I am also completely unconvinced that it has anything to do with the problem.

The FB overwrite was with 0x1703030028 and ours is 0x1703030012 and the comment they make about how they fixed this is "The SSL layer was writing to a socket that was already closed and subsequently reassigned to our database file.". It is also the #1 reason mentioned in the SQLite "How To Corrupt An SQLite Database File" link.

This seems like too coincidental to be a coincidence especially since it is identical across 3 corrupted core.db files so far, 2 of them today and one from last week. Unfortunately since they reside in /dev/shm I am unable to pull older corrupted versions. However given the regularity with which freeswitch eats its own core.db when using wss I am sure I'll have some examples in the days to come.

themsley-voiceflex commented 1 year ago

Third corrupt db today also starts

0000000 00030317 0298eb12 10f4a06d a38d8a70

andywolk commented 1 year ago

You need to update sqlite to the latest. Make sure you are using a version containing a fix described in this article: https://www.philipotoole.com/how-i-found-a-bug-in-sqlite/ The version you are using is behind the fix.

themsley-voiceflex commented 1 year ago

I agree that bug does sound like it might cause similar problems but it won't be applicable in this case as that bug only affects sqlite 'in memory' databases which are a special sqlite thing described on https://www.sqlite.org/inmemorydb.html and this is not one of those. This is a plain ordinary file on a filesystem that happens to reside in memory and that code would not apply. In addition, I tried the stress test C program in the link and it refuses to run on RHEL 9 with the distro copy of sqlite, saying that memdb is not a valid vfs. That means it's either not enabled or the code isn't present so I really doubt if we are hitting this bug in the standard build.

However, I went ahead and built the sqlite 3.40 SRPM from Fedora 38 for RHEL 9 and installed that. I checked the code in src/memdb.c for the patch and it is identical to the current code on https://www.sqlite.org/src/file?name=src/memdb.c&ci=trunk and I've also run the stress test in-memory.c from that link that runs for 10 million iterations. It has had no errors after running for 18 hours so far and that link appears to imply that it happens fairly quickly.

After about 9 hours uptime freeswitch ate its own database again. Again the first 5 bytes of the overlay are 0x1703030012. So this is not a SQLite problem.

themsley-voiceflex commented 1 year ago

I patched ws.c with some debug, file sofia-sip-ws-debug.patch.txt uploaded containing it, and watched journalctl and saw this lot today with the smoking gun in the middle there at 19:10:47. It would appear that it would have tried those other files mentioned but I suspect they failed because a) they're owned root:root and b) mod_localstream has them open but I suspect it opens R/O.

Oct 17 13:41:07 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 207 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 13:57:05 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 144 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 13:57:05 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 165 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 13:59:33 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 115 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav Oct 17 13:59:33 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 114 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 14:30:00 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 49 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 15:27:26 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 188 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 16:54:31 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 178 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav Oct 17 17:43:03 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 72 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav Oct 17 17:43:03 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 68 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 18:56:51 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 90 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav Oct 17 18:56:51 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 101 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav Oct 17 19:10:47 fstrtc01.voiceflex.com stdbuf[359010]: WS ws_close fd 114 target /dev/shm/core.db Oct 17 19:18:38 fstrtc01.voiceflex.com stdbuf[370611]: WS ws_close fd 145 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav Oct 17 19:25:03 fstrtc01.voiceflex.com stdbuf[370611]: WS ws_close fd 181 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav Oct 17 19:25:03 fstrtc01.voiceflex.com stdbuf[370611]: WS ws_close fd 183 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav Oct 17 19:25:03 fstrtc01.voiceflex.com stdbuf[370611]: WS ws_close fd 184 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav Oct 17 20:38:09 fstrtc01.voiceflex.com stdbuf[370611]: WS ws_close fd 113 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav sofia-sip-ws-debug.patch.txt

themsley-voiceflex commented 1 year ago

That timestamp corresponds with the timestamps in fs_cli for the same problem

2023-10-17 19:10:47.246372 99.30% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [file is not a database] BEGIN EXCLUSIVE 2023-10-17 19:10:47.246372 99.30% [CRIT] switch_core_sqldb.c:2109 ERROR [file is not a database], [db="/dev/shm/core.db",type="core_db"] 2023-10-17 19:10:47.246372 99.30% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [cannot commit - no transaction is active] COMMIT

themsley-voiceflex commented 1 year ago

sofia-sip-ws-ipaddr.patch.txt

I added lots more debug info to this last night. My apologies for the crudeness of the code but I have not used C in anger for about 15 years and I was more concentrating on getting the relevant info than I was about style or good practices. So with that full patch applied, it did it again this morning at 04:50 and the debug output via stdout for the code that that I added was:

Oct 18 04:17:03 fstrtc01.voiceflex.com stdbuf[377673]: WS ws_init fd 85 block 0 close 0 ip x.171.46.203:56444 Oct 18 04:17:03 fstrtc01.voiceflex.com stdbuf[377673]: WS ws_raw_write fd 85 target socket:[8645404] Oct 18 04:17:03 fstrtc01.voiceflex.com stdbuf[377673]: WS ws_raw_write fd 85 target socket:[8645404] Oct 18 04:17:03 fstrtc01.voiceflex.com stdbuf[377673]: WS ws_raw_write fd 85 target socket:[8645404] Oct 18 04:50:46 fstrtc01.voiceflex.com stdbuf[377673]: WS ws_destroy 1 fd 85 Oct 18 04:50:46 fstrtc01.voiceflex.com stdbuf[377673]: WS ws_close fd 85 target /dev/shm/core.db

That is the output from journalctl -u freeswitch | grep "fd 85"

Please note that my freeswitch systemd unit file is running freeswitch via /usr/bin/stdbuf -oL to force line buffered mode on stdout so the timestamps on the output should be correct. I have saved the entire 32MB journalctl -u freeswitch output for future use if we need it.

I am pretty sure that there is some sort of race condition going on here that causes this.

themsley-voiceflex commented 1 year ago

I'd also note that this system does have ipv6 enabled so it has ipv6 local addresses starting fe80:: but iptables denies all connections to it so it is extremely unlikely that any non-ipv4 addresses will ever connect to freeswitch on this server yet my debug output often contains ip addresses 0.0.0.0:0 which I am fairly sure is another symptom of the problem.

themsley-voiceflex commented 1 year ago

This bug can potentially overwrite ANY file that freeswitch opens in read/write mode so it's actually quite serious.

themsley-voiceflex commented 1 year ago

I would attach the full stdout output for the period between 04:17:03 - 04:50:47 (from fd 85 being opened to it being closed) but it contains 650 or so lines and identifies many ip addresses that are attempting to use this service but I do not see any way to mark attachments as private. Once gzipped it's only 3KB. If there is a private way for me to supply this, please let me know, it is saved and available.

themsley-voiceflex commented 1 year ago

@andywolk so this is with the patched version of sqlite - not that that makes any difference at all since that bug is never going to be hit unless freeswitch is magically translating /dev/shm/core.db to

rc = sqlite3_open(":memory:", &db);

without saying. Which I'm pretty sure it's not as the file /dev/shm/core.db exists and is a real sqlite database (most of the time!).

I am now 100% sure this is either a race condition or freeswitch is just broken in this code. I suspect that issues https://github.com/signalwire/freeswitch/issues/1934 and https://github.com/signalwire/freeswitch/issues/2247 may also be symptoms of the same problem. 1934 I am more sure about as we have that all time time. 2247 I am just guessing from the description and the symptoms I am seeing here where it's calling ws_close() all the time with wsh->sock set to invalid socket numbers that point at random things.