Open donnyxray opened 3 weeks ago
The large WAL is most likely a side-effect, not a cause.
Couple of questions:
Thank you for your response.
Took a quick look at the code. Are you calling the Backup APIs at all?
Yes.
I call /db/backup
once a day to make a backup. This runs at 6:25am though. The CAS errors did not start until 2:27pm.
You confirmed that the backup completed?
Wow. You got me there. The backup is too small, so something went wrong.
It's strange. When I manually run the command now, it completes correctly in less than 2 seconds:
curl -s -XGET 127.0.0.1:4001/db/backup -o /home/tmp/rqlite-240619.sqlite3
OK -- few things we're going to do.
-raft-level-log=INFO
going forward?Please examine the logs to see how long snapshotting is taking (it should be clear from the logs what I'm talking about). Also, when your system gets into this state I need the output of both /status
and /debug/vars
from the node that is logging this message.
Check the return code, via -v
to curl
.
Thanks. Will set that up now.
I just tagged v8.26.1 -- you should have release binaries and docker images within the hour.
Follow the steps in this comment with v8.26.1 and get back to me with your findings.
Looks like the using -raft-level-log=INFO
is not supported:
Jun 20 20:43:30 web5 systemd[1]: Stopped rqlite.service.
Jun 20 20:43:30 web5 systemd[1]: Started rqlite.service.
Jun 20 20:43:30 web5 rqlited[4177063]: flag provided but not defined: -raft-level-log
Jun 20 20:43:30 web5 rqlited[4177063]: rqlite is a lightweight, distributed relational database, which uses SQLite as its
Jun 20 20:43:30 web5 rqlited[4177063]: storage engine. It provides an easy-to-use, fault-tolerant store for relational data.
Jun 20 20:43:30 web5 rqlited[4177063]: Visit https://www.rqlite.io to learn more.
...
My full command:
/usr/local/sbin/rqlited -fk -node-id 5 -http-addr 127.0.0.1:4001 -raft-addr 23.x.x.x:4002 -raft-level-log=INFO -bootstrap-expect 3 -join 104.x.x.x:4002,23.x.x.x:4002,23.x.x.x:4002 /var/lib/rqlite
UPDATE:
Changing it to -raft-log-level=INFO
fixed it.
Right, typo on my part.
An occasional "can't snapshot due to CAS" error is OK. It could happen if a backup is in progress, which holds a lock on the database such that Snapshotting can't proceed. The Raft system will just try again later, but at the cost that the WAL file will grow.
rqlite checks for snapshotting every 10 secs. If you have a long-running backup operation it could block multiple cycles of Snapshotting. I'm surprised your WAL got as big as it did, unless you're inserting fairly large rows, or inserting at a high rate (100s inserts per sec).
But once the backup completes, the next snapshot should run fine. In fact backing up (if you requesting a binary mode) actually triggers a snapshot first. INFO
level logging should show you all this. None of this is a bug, but you should monitor how long your backups are taking.
A bug would be something like the backup runs to completion (or fails) but the CAS lock is never released. I don't see any sign in the code this could happen.
I think the logs leading up to the node getting in this place will tell us a lot, and then getting /status.
Also I'd like to see precisely how you are initiating your backup. There are two types of backups -- a binary backup, and a SQL text dump of the database. While they are both called "backups" there execute very different code paths, and only one of those code paths may be relevant to what you're seeing.
I've tagged v8.26.2, which has some extra logging of error conditions. If you could run that version it might help -- binaries available within the hour.
I do really backup using the curl command supplied earlier:
#!/bin/sh
YYMMDD=`date +%Y%m%d`
...
curl -s -XGET 127.0.0.1:4001/db/backup -o /home/tmp/rqlite-${YYMMDD}.sqlite3
...
I have this in a a cron.daily script which then zips the backup and uploads it to an off-site server.
Running the backup currently takes 4.6 seconds. I see no CAS error during or after. At least not today.
Will update to 8.26.2.
I would be passing -v
to curl and logging it. Are you not checking the return code of curl
? You should always do that, to ensure that the backup request was successful.
You could so other things too, like checking that what you get is an actual SQLite file.
Great point. I'll run a PRAGMA integrity_check;
on the resulting file.
The problem was away for a bit - but it's here now.
Also note that I've added a few read-only nodes that were not there previously.
Status:
{
"build": {
"branch": "master",
"build_time": "2024-06-20T13:57:16+0000",
"commit": "897fc2e0bf7e2e3ee4e26e3fa71080d9824d47bd",
"compiler_command": "gcc",
"compiler_toolchain": "gc",
"version": "v8.26.2"
},
"cluster": {
"addr": "1.1.1.1:4002",
"api_addr": "127.0.0.1:4001",
"https": "false"
},
"http": {
"auth": "enabled",
"bind_addr": "127.0.0.1:4001",
"cluster": {
"conn_pool_stats": {
"2.2.2.2:4002": {
"idle": 56,
"max_open_connections": 64,
"open_connections": 4226
},
"3.3.3.3:4002": {
"idle": 1,
"max_open_connections": 64,
"open_connections": 1
}
},
"local_node_addr": "1.1.1.1:4002",
"timeout": "30s"
},
"queue": {
"_default": {
"batch_size": 128,
"max_size": 1024,
"sequence_number": 0,
"timeout": "50ms"
}
},
"tls": {
"enabled": "false"
}
},
"mux": {
"addr": "1.1.1.1:4002",
"handlers": "\u0001\u0002",
"timeout": "30s",
"tls": "disabled"
},
"network": {
"interfaces": {
"eno1": {
"flags": "up|broadcast|multicast|running",
"hardware_address": "3c:ec:ef:c8:57:2c",
"addresses": [
{
"address": "1.1.1.1.1/29"
},
{
"address": "1.1.1.2/29"
},
{
"address": "1.1.1.3/29"
},
{
"address": "1.1.1.4/29"
},
{
"address": "1.1.1.5/29"
},
{
"address": "fe80::3eec:efff:fec8:572c/64"
}
]
},
"eth1": {
"flags": "up|broadcast|multicast|running",
"hardware_address": "3c:ec:ef:c8:57:2d",
"addresses": [
{
"address": "10.65.0.4/24"
},
{
"address": "fe80::3eec:efff:fec8:572d/64"
}
]
},
"lo": {
"flags": "up|loopback|running",
"hardware_address": "",
"addresses": [
{
"address": "127.0.0.1/8"
},
{
"address": "::1/128"
}
]
},
"usb0": {
"flags": "broadcast|multicast",
"hardware_address": "6e:5b:88:8a:74:d5",
"addresses": null
},
"wg0": {
"flags": "up|pointtopoint|running",
"hardware_address": "",
"addresses": [
{
"address": "10.62.0.11/24"
}
]
}
}
},
"node": {
"current_time": "2024-06-24T09:40:35.770097295+08:00",
"start_time": "2024-06-20T22:12:43.192165135+08:00",
"uptime": "83h27m52.577932497s"
},
"os": {
"executable": "/usr/local/sbin/rqlited",
"hostname": "web1",
"page_size": 4096,
"pid": 2295804,
"ppid": 1
},
"runtime": {
"GOARCH": "amd64",
"GOMAXPROCS": 12,
"GOOS": "linux",
"num_cpu": 12,
"num_goroutine": 130,
"version": "go1.22.4"
},
"store": {
"addr": "1.1.1.1:4002",
"apply_timeout": "10s",
"db_applied_index": 9266067,
"db_conf": {
"fk_constraints": true
},
"dir": "/var/lib/rqlite",
"dir_size": 10277224122,
"dir_size_friendly": "10 GB",
"election_timeout": "1s",
"fsm_index": 9266067,
"fsm_update_time": "2024-06-24T09:40:35.766120923+08:00",
"heartbeat_timeout": "1s",
"leader": {
"addr": "1.1.1.1:4002",
"node_id": "1"
},
"leader_appended_at_time": "2024-06-24T09:40:35.754131445+08:00",
"no_freelist_sync": false,
"node_id": "1",
"nodes": [
{
"id": "1",
"addr": "1.1.1.1:4002",
"suffrage": "Voter"
},
{
"id": "2",
"addr": "7.7.7.7:4002",
"suffrage": "Nonvoter"
},
{
"id": "3",
"addr": "3.3.3.3:4002",
"suffrage": "Voter"
},
{
"id": "5",
"addr": "2.2.2.2:4002",
"suffrage": "Voter"
},
{
"id": "6",
"addr": "4.4.4.4:4002",
"suffrage": "Nonvoter"
},
{
"id": "7",
"addr": "5.5.5.5:4002",
"suffrage": "Nonvoter"
},
{
"id": "8",
"addr": "6.6.6.6:4002",
"suffrage": "Nonvoter"
}
],
"observer": {
"dropped": 0,
"observed": 155
},
"open": {},
"raft": {
"applied_index": 9266067,
"bolt": {
"TxStats": {
"PageCount": 30366494,
"PageAlloc": 124381159424,
"CursorCount": 11422850,
"NodeCount": 13497700,
"NodeDeref": 8,
"Rebalance": 1226115,
"RebalanceTime": 1162526944,
"Split": 754726,
"Spill": 13635920,
"SpillTime": 71871743844,
"Write": 19765428,
"WriteTime": 1442862570791
},
"FreePageN": 0,
"PendingPageN": 6,
"FreeAlloc": 24576,
"FreelistInuse": 64,
"TxN": 1031653,
"OpenTxN": 0
},
"commit_index": 9266067,
"fsm_pending": 0,
"last_contact": 0,
"last_log_index": 9266067,
"last_log_term": 647,
"last_snapshot_index": 8537249,
"last_snapshot_term": 647,
"latest_configuration": "[{Suffrage:Voter ID:1 Address:1.1.1.1:4002} {Suffrage:Voter ID:3 Address:3.3.3.3:4002} {Suffrage:Voter ID:5 Address:2.2.2.2:4002} {Suffrage:Nonvoter ID:6 Address:4.4.4.4:4002} {Suffrage:Nonvoter ID:7 Address:5.5.5.5:4002} {Suffrage:Nonvoter ID:8 Address:6.6.6.6:4002} {Suffrage:Nonvoter ID:2 Address:7.7.7.7:4002}]",
"latest_configuration_index": 0,
"log_size": 593436672,
"num_peers": 2,
"protocol_version": 3,
"protocol_version_max": 3,
"protocol_version_min": 0,
"snapshot_version_max": 1,
"snapshot_version_min": 0,
"state": "Leader",
"term": 647,
"transport": {
"command_commit_index": 8272433,
"leader_commit_index": 8272433
},
"voter": true
},
"ready": true,
"reap_read_only_timeout": "0s",
"reap_timeout": "0s",
"request_marshaler": {
"compression_batch": 50,
"compression_size": 1024,
"force_compression": false
},
"snapshot_cas": {
"owner": "backup"
},
"snapshot_interval": "10s",
"snapshot_store": {
"db_path": "/var/lib/rqlite/rsnapshots/647-8537249-1719123901479.db",
"dir": "/var/lib/rqlite/rsnapshots",
"snapshots": [
"647-8537249-1719123901479"
]
},
"snapshot_threshold": 8192,
"sqlite3": {
"compile_options": [
"ATOMIC_INTRINSICS=1",
"COMPILER=gcc-9.4.0",
"DEFAULT_AUTOVACUUM",
"DEFAULT_CACHE_SIZE=-2000",
"DEFAULT_FILE_FORMAT=4",
"DEFAULT_JOURNAL_SIZE_LIMIT=-1",
"DEFAULT_MMAP_SIZE=0",
"DEFAULT_PAGE_SIZE=4096",
"DEFAULT_PCACHE_INITSZ=20",
"DEFAULT_RECURSIVE_TRIGGERS",
"DEFAULT_SECTOR_SIZE=4096",
"DEFAULT_SYNCHRONOUS=2",
"DEFAULT_WAL_AUTOCHECKPOINT=1000",
"DEFAULT_WAL_SYNCHRONOUS=1",
"DEFAULT_WORKER_THREADS=0",
"DIRECT_OVERFLOW_READ",
"ENABLE_DBSTAT_VTAB",
"ENABLE_FTS3",
"ENABLE_FTS3_PARENTHESIS",
"ENABLE_FTS5",
"ENABLE_RTREE",
"ENABLE_UPDATE_DELETE_LIMIT",
"MALLOC_SOFT_LIMIT=1024",
"MAX_ATTACHED=10",
"MAX_COLUMN=2000",
"MAX_COMPOUND_SELECT=500",
"MAX_DEFAULT_PAGE_SIZE=8192",
"MAX_EXPR_DEPTH=1000",
"MAX_FUNCTION_ARG=127",
"MAX_LENGTH=1000000000",
"MAX_LIKE_PATTERN_LENGTH=50000",
"MAX_MMAP_SIZE=0x7fff0000",
"MAX_PAGE_COUNT=0xfffffffe",
"MAX_PAGE_SIZE=65536",
"MAX_SQL_LENGTH=1000000000",
"MAX_TRIGGER_DEPTH=1000",
"MAX_VARIABLE_NUMBER=32766",
"MAX_VDBE_OP=250000000",
"MAX_WORKER_THREADS=8",
"MUTEX_PTHREADS",
"OMIT_DEPRECATED",
"OMIT_LOAD_EXTENSION",
"OMIT_SHARED_CACHE",
"SYSTEM_MALLOC",
"TEMP_STORE=1",
"THREADSAFE=1"
],
"conn_pool_stats": {
"ro": {
"max_open_connections": 0,
"open_connections": 1,
"in_use": 0,
"idle": 1,
"wait_count": 0,
"wait_duration": 0,
"max_idle_closed": 1,
"max_idle_time_closed": 1010,
"max_lifetime_closed": 0
},
"rw": {
"max_open_connections": 1,
"open_connections": 1,
"in_use": 0,
"idle": 1,
"wait_count": 0,
"wait_duration": 0,
"max_idle_closed": 0,
"max_idle_time_closed": 0,
"max_lifetime_closed": 0
}
},
"db_size": 1044926464,
"db_size_friendly": "1.0 GB",
"last_modified": "2024-06-24T09:40:35.751263163+08:00",
"mem_stats": {
"cache_size": -2000,
"freelist_count": 0,
"hard_heap_limit": 0,
"max_page_count": 4294967294,
"page_count": 255109,
"page_size": 4096,
"soft_heap_limit": 0
},
"path": "/var/lib/rqlite/db.sqlite",
"pragmas": {
"ro": {
"busy_timeout": "5000",
"foreign_keys": "1",
"journal_mode": "wal",
"synchronous": "0",
"wal_autocheckpoint": "1000"
},
"rw": {
"busy_timeout": "5000",
"foreign_keys": "1",
"journal_mode": "wal",
"synchronous": "0",
"wal_autocheckpoint": "0"
}
},
"ro_dsn": "file:/var/lib/rqlite/db.sqlite?_fk=true\u0026_journal=WAL\u0026_sync=0\u0026mode=ro",
"rw_dsn": "file:/var/lib/rqlite/db.sqlite?_fk=true\u0026_journal=WAL\u0026_sync=0",
"size": 1054515200,
"version": "3.45.1",
"wal_size": 7560076432
},
"trailing_logs": 10240
}
Debug vars:
{
"cluster": {"num_backup_req": 7, "num_client_execute_retries": 4134, "num_client_load_retries": 0, "num_client_query_retries": 36, "num_client_request_retries": 0, "num_client_retries": 4170, "num_execute_req": 2159778, "num_get_node_api_req": 750917, "num_get_node_api_req_local": 492824, "num_get_node_api_req_retries": 0, "num_get_node_api_resp": 750917, "num_join_req": 5, "num_load_req": 0, "num_notify_req": 0, "num_query_req": 88661, "num_remove_node_req": 0, "num_request_req": 0},
"cmdline": ["/usr/local/sbin/rqlited","-fk","-node-id","1","-http-addr","127.0.0.1:4001","-raft-addr","1.1.1.1:4002","-raft-log-level=INFO","-bootstrap-expect","3","-join","1.1.1.1:4002,3.3.3.3:4002,2.2.2.2:4002","/var/lib/rqlite"],
"db": {"backup_sleeps": 0, "backup_step_dones": 0, "backup_step_errors": 0, "checkpoint_duration_ms": 2, "checkpoint_errors": 0, "checkpointed_moves": 0, "checkpointed_pages": 0, "checkpoints": 1462, "execute_transactions": 0, "execution_errors": 60472, "executions": 3088600, "executions_force_queries": 0, "open_duration_ms": 4, "queries": 173430, "query_errors": 1, "query_transactions": 0, "request_transactions": 0, "requests": 0},
"downloader": {"download_bytes": 0, "num_downloads_fail": 0, "num_downloads_ok": 0},
"http": {"authFail": 0, "authOK": 1418481, "backups": 0, "boot": 0, "execute_stmts_rx": 829548, "executions": 829548, "leader_not_found": 3, "loads": 0, "loads_aborted": 0, "num_readyz": 524362, "num_status": 2, "queries": 64568, "query_stmts_rx": 64568, "queued_executions": 0, "queued_executions_failed": 0, "queued_executions_leadership_lost": 0, "queued_executions_no_leader": 0, "queued_executions_not_leader": 0, "queued_executions_num_stmts_rx": 0, "queued_executions_num_stmts_tx": 0, "queued_executions_ok": 0, "queued_executions_unknown_error": 0, "queued_executions_wait": 0, "remote_backups": 0, "remote_executions": 49432, "remote_executions_failed": 6562, "remote_loads": 0, "remote_queries": 5293, "remote_queries_failed": 45, "remote_remove_node": 0, "remote_requests": 0, "remote_requests_failed": 0, "request_stmts_rx": 0, "requests": 0, "user_snapshots": 0},
"memstats": {"Alloc":15426976,"TotalAlloc":319772522648,"Sys":746533288,"Lookups":0,"Mallocs":1721016498,"Frees":1720978723,"HeapAlloc":15426976,"HeapSys":719429632,"HeapIdle":696197120,"HeapInuse":23232512,"HeapReleased":684335104,"HeapObjects":37775,"StackInuse":1966080,"StackSys":1966080,"MSpanInuse":366080,"MSpanSys":10444800,"MCacheInuse":14400,"MCacheSys":15600,"BuckHashSys":2252499,"GCSys":8918136,"OtherSys":3506541,"NextGC":31553752,"LastGC":1719193306814080093,"PauseTotalNs":3901939335,"PauseNs":[44497,79612,76982,75744,71860,127897,68394,126137,56357,82675,77853,94376,96358,81800,82051,132069,79316,68184,78872,102131,202452,81690,38442,110598,84296,69430,57239,128375,105849,120352,121466,67957,110078,73119,89165,58349,88102,66106,85245,61330,62633,113988,66916,65902,101028,65382,80362,98741,110551,95438,28572,92591,79433,70606,82791,78395,64372,99653,122090,51245,87882,62973,78797,105746,118696,97282,60684,90204,59849,70881,55769,77348,57206,108557,101244,102391,37656,101548,108117,122363,90667,103843,41884,64419,75561,54182,117603,99227,59073,99706,76583,79707,106037,59599,35833,47858,69201,79058,78230,83046,151438,48215,96451,99440,29245,46852,58518,36449,30395,52520,59197,37772,56280,82936,72773,139054,42555,97358,105214,1358665,28105,109153,1331007,67906,242369,50004,114193,127781,126919,132889,112620,79043,50948,50965,62197,146340,85154,121992,188081,197428,50470,52942,76460,125601,126074,91435,143682,126831,201990,119095,194680,76891,51457,96480,94094,133353,74802,123098,197189,38705,82327,72754,85070,62172,60296,46907,126346,67889,66350,57159,74289,39102,58428,59104,81056,65637,68261,55198,86822,73439,56037,141477,108490,43584,74231,68328,153679,30950,108590,76145,61418,75305,5155590,93598,31394,61464,59112,72294,36003,99272,59824,147629,59753,41863,72548,94970,69225,66418,67947,66655,48703,87376,127878,94173,85571,48730,81566,59862,79825,47496,64988,91539,125172,61245,101108,102770,66674,78294,112491,78269,91010,88633,173717,89380,67411,69001,125303,80475,83537,70464,94054,149459,65470,64862,66675,61267,173818,160446,102906,150791,84027,92970,90739,82240,68199,106132],"PauseEnd":[1719193202423207947,1719193222482547611,1719193250690371557,1719193278872957274,1719193306814080093,1719188424145023617,1719188450831779788,1719188488547147953,1719188516612017321,1719188544137121405,1719188582671726357,1719188623186819701,1719188653366385069,1719188688519932906,1719188711780678097,1719188747482737415,1719188770997816506,1719188804730306111,1719188836918769217,1719188873725718831,1719188901011388517,1719188933184142561,1719188957391909705,1719188989949591118,1719189003898069396,1719189033135224936,1719189065033519565,1719189104314720260,1719189125939131638,1719189149525713152,1719189179315694636,1719189201503386041,1719189220807449922,1719189242862731006,1719189244083090576,1719189245606994659,1719189268796034014,1719189297959445046,1719189313793181831,1719189339235646353,1719189371049962526,1719189396125903938,1719189427718947023,1719189459145366189,1719189483003990705,1719189486510283327,1719189489631039408,1719189512377098962,1719189545967518303,1719189584098210996,1719189602491250417,1719189622610430963,1719189655708165944,1719189689330054989,1719189729489689925,1719189770316249444,1719189802999721526,1719189839764488474,1719189869939669647,1719189901689549620,1719189925103478995,1719189950226058574,1719189978435026330,1719190009011681008,1719190031162750877,1719190046777840361,1719190065960108892,1719190091965017470,1719190119332238070,1719190154714088087,1719190186492595689,1719190203407933387,1719190229333303813,1719190262108219632,1719190288017799639,1719190307648322991,1719190327735452929,1719190355738365015,1719190381369299670,1719190403509326527,1719190420586732683,1719190454681975794,1719190488831840112,1719190503805970885,1719190542025207720,1719190566239923991,1719190578238326174,1719190603303881484,1719190629659165465,1719190661168032097,1719190693207003481,1719190723355056216,1719190751383089252,1719190778456516873,1719190802378868904,1719190827868472737,1719190836355963960,1719190845054511427,1719190855845919405,1719190861791834872,1719190862225522732,1719190862616975077,1719190863073146633,1719190863497727475,1719190863900433375,1719190864288749240,1719190864627651796,1719190864962582200,1719190865198185183,1719190865458749643,1719190865677994205,1719190865922945494,1719190866208331023,1719190866478783427,1719190866802314644,1719190867063646675,1719190867369184776,1719190867664923549,1719190867910555932,1719190868275317733,1719190868608222883,1719190868967062635,1719190869249301071,1719190869633354802,1719190870096404200,1719190870516733060,1719190870961206697,1719190871341478484,1719190871828661510,1719190872197862136,1719190872658689746,1719190873170002833,1719190873710419369,1719190874172371046,1719190874845794934,1719190875413863564,1719190875915681615,1719190876534228310,1719190877377621180,1719190878065098640,1719190878897027191,1719190879528851227,1719190880195582173,1719190880808468670,1719190881533102284,1719190882213927054,1719190883080958955,1719190883899798744,1719190884599907557,1719190885383451112,1719190886362115616,1719190887258794355,1719190888157700768,1719190889170469987,1719190890274804410,1719190891395969953,1719190892688453179,1719190894249423070,1719190896153330079,1719190898468291713,1719190900270419828,1719190902632848037,1719190905250345439,1719190908637518009,1719190913590177390,1719190922297553423,1719190943071663995,1719190966606905028,1719191000781416856,1719191024546902657,1719191053480691914,1719191082213452916,1719191102137725912,1719191122466344470,1719191151374443434,1719191183759033136,1719191214816381062,1719191247998827359,1719191273356304810,1719191300665854810,1719191322822882285,1719191351517259071,1719191384833942133,1719191402500419267,1719191427693991092,1719191453612010470,1719191482307245229,1719191512739531846,1719191539144424213,1719191572012054991,1719191602429377910,1719191638395915698,1719191663563019147,1719191694044745036,1719191702400142294,1719191721002799723,1719191740928846884,1719191761055259949,1719191789183073962,1719191816618797286,1719191841183035927,1719191868360028808,1719191887888763273,1719191912255943944,1719191938764933608,1719191959744367577,1719191984978946743,1719192002400841305,1719192022797309586,1719192048445567345,1719192074451544146,1719192102464758695,1719192127977158172,1719192153869274501,1719192176533248360,1719192191535845655,1719192207055250046,1719192226862696698,1719192258105070159,1719192279285833426,1719192301154176117,1719192323397977127,1719192356954161373,1719192387675319005,1719192416964573345,1719192450360752968,1719192486867523523,1719192515715012785,1719192540778528328,1719192566848869738,1719192599259126968,1719192613173907838,1719192640250347872,1719192663436937187,1719192687603459396,1719192715898486171,1719192738855734546,1719192766671764719,1719192796273435168,1719192828412989902,1719192842422747912,1719192843307416589,1719192848830766380,1719192868123391783,1719192888135057260,1719192902012879184,1719192915545381336,1719192944903216260,1719192977651376583,1719193011472262481,1719193046285778444,1719193074749577334,1719193099025831186,1719193127473458138,1719193159770215046,1719193185570589399],"NumGC":48133,"NumForcedGC":0,"GCCPUFraction":0.0000658545687376553,"EnableGC":true,"DebugGC":false,"BySize":[{"Size":0,"Mallocs":0,"Frees":0},{"Size":8,"Mallocs":56895295,"Frees":56893162},{"Size":16,"Mallocs":352727755,"Frees":352718395},{"Size":24,"Mallocs":225128212,"Frees":225126739},{"Size":32,"Mallocs":119810058,"Frees":119809274},{"Size":48,"Mallocs":156699640,"Frees":156696597},{"Size":64,"Mallocs":122940813,"Frees":122939815},{"Size":80,"Mallocs":227328596,"Frees":227325982},{"Size":96,"Mallocs":112253750,"Frees":112251877},{"Size":112,"Mallocs":21768358,"Frees":21768182},{"Size":128,"Mallocs":18293936,"Frees":18293488},{"Size":144,"Mallocs":37468697,"Frees":37458947},{"Size":160,"Mallocs":2149379,"Frees":2149184},{"Size":176,"Mallocs":3793835,"Frees":3793113},{"Size":192,"Mallocs":8919805,"Frees":8919679},{"Size":208,"Mallocs":4892472,"Frees":4892172},{"Size":224,"Mallocs":7515902,"Frees":7515766},{"Size":240,"Mallocs":997913,"Frees":997822},{"Size":256,"Mallocs":1727077,"Frees":1727011},{"Size":288,"Mallocs":5283429,"Frees":5282803},{"Size":320,"Mallocs":4385013,"Frees":4384814},{"Size":352,"Mallocs":6820437,"Frees":6820303},{"Size":384,"Mallocs":1599952,"Frees":1599883},{"Size":416,"Mallocs":572503,"Frees":572475},{"Size":448,"Mallocs":986750,"Frees":985117},{"Size":480,"Mallocs":1287615,"Frees":1287610},{"Size":512,"Mallocs":25398617,"Frees":25398569},{"Size":576,"Mallocs":1959110,"Frees":1959050},{"Size":640,"Mallocs":1715428,"Frees":1715374},{"Size":704,"Mallocs":2774583,"Frees":2774561},{"Size":768,"Mallocs":891436,"Frees":891417},{"Size":896,"Mallocs":1328672,"Frees":1328616},{"Size":1024,"Mallocs":1436175,"Frees":1436145},{"Size":1152,"Mallocs":577400,"Frees":577302},{"Size":1280,"Mallocs":543995,"Frees":543979},{"Size":1408,"Mallocs":471928,"Frees":471914},{"Size":1536,"Mallocs":977943,"Frees":977939},{"Size":1792,"Mallocs":1162475,"Frees":1162441},{"Size":2048,"Mallocs":541948,"Frees":541917},{"Size":2304,"Mallocs":517217,"Frees":517082},{"Size":2688,"Mallocs":388411,"Frees":388408},{"Size":3072,"Mallocs":216081,"Frees":216075},{"Size":3200,"Mallocs":34258,"Frees":34258},{"Size":3456,"Mallocs":35841,"Frees":35839},{"Size":4096,"Mallocs":4269103,"Frees":4269056},{"Size":4864,"Mallocs":79960,"Frees":79936},{"Size":5376,"Mallocs":17501,"Frees":17495},{"Size":6144,"Mallocs":471530,"Frees":471526},{"Size":6528,"Mallocs":273759,"Frees":273759},{"Size":6784,"Mallocs":181465,"Frees":181465},{"Size":6912,"Mallocs":89515,"Frees":89515},{"Size":8192,"Mallocs":1062102,"Frees":1062081},{"Size":9472,"Mallocs":888297,"Frees":888283},{"Size":9728,"Mallocs":178833,"Frees":178833},{"Size":10240,"Mallocs":356088,"Frees":356088},{"Size":10880,"Mallocs":448192,"Frees":448191},{"Size":12288,"Mallocs":155674,"Frees":155671},{"Size":13568,"Mallocs":116593,"Frees":116593},{"Size":14336,"Mallocs":79685,"Frees":79685},{"Size":16384,"Mallocs":427149,"Frees":427145},{"Size":18432,"Mallocs":266757,"Frees":266754}]},
"mux": {"num_connections_handled": 1400, "num_unregistered_handlers": 18},
"proto": {"num_compressed_bytes": 5274, "num_compressed_requests": 6, "num_compression_misses": 0, "num_precompressed_bytes": 1005634538, "num_requests": 3054788, "num_uncompressed_bytes": 1005621497, "num_uncompressed_requests": 3054782},
"queue": {"num_flush": 0, "num_timeout": 0, "objects_rx": 0, "objects_tx": 0},
"snapshot": {"latest_persist_duration": 1, "latest_persist_size": 1322552, "snapshot_create_mrsw_fail": 3, "snapshot_open_mrsw_fail": 0, "snapshots_reaped": 1459, "snapshots_reaped_failed": 0, "upgrade_fail": 0, "upgrade_ok": 0},
"store": {"auto_vacuum_duration": 0, "failed_heartbeat_observed": 148, "leader_changes_dropped": 0, "leader_changes_observed": 0, "nodes_reaped_failed": 0, "nodes_reaped_ok": 0, "num_auto_restores": 0, "num_auto_restores_failed": 0, "num_auto_restores_skipped": 0, "num_auto_vacuums": 0, "num_auto_vacuums_failed": 0, "num_backups": 5, "num_boots": 0, "num_compressed_commands": 6, "num_db_stats_errors": 0, "num_full_checkpoint_failed": 0, "num_ignored_joins": 3, "num_joins": 2, "num_loads": 0, "num_provider_checks": 0, "num_provider_provides": 0, "num_provider_provides_fail": 0, "num_recoveries": 0, "num_removed_before_joins": 0, "num_restores": 1, "num_restores_failed": 0, "num_snapshots": 1462, "num_snapshots_failed": 16, "num_snapshots_full": 0, "num_snapshots_incremental": 1462, "num_uncompressed_commands": 3054782, "num_user_snapshots": 1453, "num_user_snapshots_failed": 4675, "num_wal_checkpoint_truncate_failed": 0, "num_wal_snapshots": 6121, "num_wal_snapshots_failed": 4674, "snapshot_create_chk_truncate_duration": 2, "snapshot_create_duration": 10, "snapshot_create_wal_compact_duration": 7, "snapshot_persist_duration": 1, "snapshot_precompact_wal_size": 2303112, "snapshot_wal_size": 1322552},
"uploader": {"last_upload_bytes": 0, "num_sum_get_fail": 0, "num_uploads_fail": 0, "num_uploads_ok": 0, "num_uploads_skipped": 0, "num_uploads_skipped_id": 0, "total_upload_bytes": 0},
"wal": {"compact_index_duration_ms": 1, "compact_load_duration_ms": 5, "compact_load_page_count": 321}
}
EDIT: removed the bit about VACUUM, as my logs show the issue started prior to running the VACUUM.
Here's the log snippet where the CAS error first appears:
.. many successful snapshots prior to this
Jun 23 14:20:18 web1 rqlited[2295804]: [store] 2024/06/23 14:20:18 incremental snapshot created in 15.37676ms on node ID 1
Jun 23 14:20:18 web1 rqlited[2295804]: [raft] 2024/06/23 14:20:18 [INFO] starting snapshot up to: index=8535919
Jun 23 14:20:18 web1 rqlited[2295804]: [store] 2024/06/23 14:20:18 persisted snapshot 647-8535919-1719123618399 in 2.443578ms
Jun 23 14:20:18 web1 rqlited[2295804]: [snapshot-store] 2024/06/23 14:20:18 reaped snapshot 647-8534307-1719123306654
Jun 23 14:20:18 web1 rqlited[2295804]: [raft] 2024/06/23 14:20:18 [INFO] compacting logs: from=8524068 to=8525679
Jun 23 14:20:18 web1 rqlited[2295804]: [raft] 2024/06/23 14:20:18 [INFO] snapshot complete up to: index=8535919
Jun 23 14:25:01 web1 rqlited[2295804]: [store] 2024/06/23 14:25:01 incremental snapshot created in 10.117626ms on node ID 1
Jun 23 14:25:01 web1 rqlited[2295804]: [raft] 2024/06/23 14:25:01 [INFO] starting snapshot up to: index=8537249
Jun 23 14:25:01 web1 rqlited[2295804]: [store] 2024/06/23 14:25:01 persisted snapshot 647-8537249-1719123901479 in 1.796491ms
Jun 23 14:25:01 web1 rqlited[2295804]: [snapshot-store] 2024/06/23 14:25:01 reaped snapshot 647-8535919-1719123618399
Jun 23 14:25:01 web1 rqlited[2295804]: [raft] 2024/06/23 14:25:01 [INFO] compacting logs: from=8525680 to=8527009
Jun 23 14:25:01 web1 rqlited[2295804]: [raft] 2024/06/23 14:25:01 [INFO] snapshot complete up to: index=8537249
Jun 23 14:30:12 web1 rqlited[2295804]: [raft] 2024/06/23 14:30:12 [ERROR] failed to take snapshot: error="failed to start snapshot: CAS conflict"
Jun 23 14:30:12 web1 rqlited[2295804]: [store] 2024/06/23 14:30:12 failed to generate user-requested snapshot: failed to start snapshot: CAS conflict
Jun 23 14:30:12 web1 rqlited[2295804]: [store] 2024/06/23 14:30:12 failed to snapshot due to WAL threshold: failed to start snapshot: CAS conflict
Jun 23 14:30:27 web1 rqlited[2295804]: [raft] 2024/06/23 14:30:27 [ERROR] failed to take snapshot: error="failed to start snapshot: CAS conflict"
Jun 23 14:30:27 web1 rqlited[2295804]: [store] 2024/06/23 14:30:27 failed to generate user-requested snapshot: failed to start snapshot: CAS conflict
Jun 23 14:30:27 web1 rqlited[2295804]: [store] 2024/06/23 14:30:27 failed to snapshot due to WAL threshold: failed to start snapshot: CAS conflict
Jun 23 14:30:41 web1 rqlited[2295804]: [raft] 2024/06/23 14:30:41 [ERROR] failed to take snapshot: error="failed to start snapshot: CAS conflict"
Jun 23 14:30:41 web1 rqlited[2295804]: [store] 2024/06/23 14:30:41 failed to generate user-requested snapshot: failed to start snapshot: CAS conflict
Jun 23 14:30:41 web1 rqlited[2295804]: [store] 2024/06/23 14:30:41 failed to snapshot due to WAL threshold: failed to start snapshot: CAS conflict
Jun 23 14:30:56 web1 rqlited[2295804]: [raft] 2024/06/23 14:30:56 [ERROR] failed to take snapshot: error="failed to start snapshot: CAS conflict"
Jun 23 14:30:56 web1 rqlited[2295804]: [store] 2024/06/23 14:30:56 failed to generate user-requested snapshot: failed to start snapshot: CAS conflict
Jun 23 14:30:56 web1 rqlited[2295804]: [store] 2024/06/23 14:30:56 failed to snapshot due to WAL threshold: failed to start snapshot: CAS conflict
.. many more CAS errors following this
Also: today's backup - made during CAS errors - failed. The 3 days before that were fine.
After restart rqlited uses 100% CPU and the WAL has grown to 4 times the actual database size. Will let it go a little bit longer, but looks like I'll need to kill it and do a fresh start on this node.
When it gets into this state (CAS conflict) need the output of /status and /debug/vars, as I requested previously.
Sent from Yahoo Mail for iPhone
On Sunday, June 23, 2024, 10:26 PM, donnyxray @.***> wrote:
After restart rqlited uses 100% CPU and the WAL has grown to 4 times the actual database size. Will let it go a little bit longer, but looks like I'll need to kill it and do a fresh start on this node.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Ah, I see you included it. Thanks.
Sent from Yahoo Mail for iPhone
On Saturday, June 29, 2024, 9:30 AM, Philip O'Toole @.***> wrote:
When it gets into this state (CAS conflict) need the output of /status and /debug/vars, as I requested previously.
Sent from Yahoo Mail for iPhone
On Sunday, June 23, 2024, 10:26 PM, donnyxray @.***> wrote:
After restart rqlited uses 100% CPU and the WAL has grown to 4 times the actual database size. Will let it go a little bit longer, but looks like I'll need to kill it and do a fresh start on this node.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
It looks like the backup is holding the CAS lock. Are you monitoring your backups and logging how long each takes? Precisely when and how are you doing VACUUM?
Sent from Yahoo Mail for iPhone
On Saturday, June 29, 2024, 9:33 AM, Philip O'Toole @.***> wrote:
Ah, I see you included it. Thanks.
Sent from Yahoo Mail for iPhone
On Saturday, June 29, 2024, 9:30 AM, Philip O'Toole @.***> wrote:
When it gets into this state (CAS conflict) need the output of /status and /debug/vars, as I requested previously.
Sent from Yahoo Mail for iPhone
On Sunday, June 23, 2024, 10:26 PM, donnyxray @.***> wrote:
After restart rqlited uses 100% CPU and the WAL has grown to 4 times the actual database size. Will let it go a little bit longer, but looks like I'll need to kill it and do a fresh start on this node.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
The VACUUM was a one off, issued manually using the rqlite cli. The CAS issue started several hours before that - I just didn't notice it until after I did the VACUUM and checked the logs.
After I recovered from this CAS error all backups have been verified correct and completed in mere seconds. The CAS issue has not yet returned. (I must note that I have since reduced the database size by moving some larger tables to a separate per-node plain sqlite database).
Right, but the information you provided implies that some backup took a lot longer than seconds. It’s the backups BEFORE the CAS conflicts starts that concern us. Alternatively a smoking gun for a bug would be the following:
On Saturday, June 29, 2024, 9:45 AM, donnyxray @.***> wrote:
The VACUUM was a one off, issued manually using the rqlite cli. The CAS issue started several hours before that - I just didn't notice it until after I did the VACUUM and checked the logs.
After I recovered from this CAS error all backups have been verified correct and completed in mere seconds. The CAS issue has not yet returned. (I must note that I have since reduced the database size by moving some larger tables to a separate per-node plain sqlite database).
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
@donnyxray -- if you could upgrade to v8.26.3 we might get more insight into what is going on. Builds should be available within an hour or so (of this message).
Every few days the leader in my 3 node cluster ends up with this error:
I have a hard time figuring out what this means or how to resolve it. (I currently delete the database and have it restore from a different node as a work-around).
The WAL is fairly large when this happens: