Open Ziedelth opened 3 days ago
Another crash:
{"time":"2024-06-27T15:32:27.471176779Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9651-1719502347471.tmp"}
2024-06-27T15:32:27.674Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T15:35:00.319Z [INFO] raft: starting snapshot up to: index=9657
{"time":"2024-06-27T15:35:00.320042783Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9657-1719502500320.tmp"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xec2544]
goroutine 111 [running]:
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact.func1()
/home/runner/work/vince/vince/internal/lsm/part.go:114 +0xa4
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact(0xc007544200, 0xc0071dba10)
/home/runner/work/vince/vince/internal/lsm/part.go:146 +0x2e5
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Compact(0xc007176300, 0xc00d745008)
/home/runner/work/vince/vince/internal/lsm/lsm.go:268 +0xd6
github.com/vinceanalytics/vince/internal/session.(*Session).Persist(0xc0072fa0f0, 0xc00d745008)
/home/runner/work/vince/vince/internal/session/session.go:101 +0x30
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Backup(0xc00d64f0b0, {0x76bb44ea0820, 0xc00009ba40})
/home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:38 +0x13e
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Persist(0xc00d640570?, {0x1d4d390?, 0xc00009ba40?})
/home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:27 +0x3e
github.com/hashicorp/raft.(*Raft).takeSnapshot(0xc00714c2c8)
/home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/snapshot.go:190 +0x8c8
github.com/hashicorp/raft.(*Raft).runSnapshots(0xc00714c2c8)
/home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/snapshot.go:82 +0x3ee
github.com/hashicorp/raft.(*raftState).goFunc.func1()
/home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/state.go:149 +0x56
created by github.com/hashicorp/raft.(*raftState).goFunc in goroutine 1
/home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/state.go:147 +0x79
I don't know if it's the same error has previously or not
Hi, It seems there was a bug during compaction causing things to spiral
{"time":"2024-06-27T03:52:12.085446794Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"} panic: arrow/array: field 15 has 18410 rows. want=18412
This message signals that. This probably means a corrupt snapshot was saved . One remedy is to manually delete the offending snapshot. Deleting /app/data/snapshots/snapshots/13-9212-1719460105736.tmp
might help.
I am still working on a better storage for vince using roaring bitmaps. All of these issues will go away.
I deleted all the snapshots, but the problem seems to come back quite quickly.
{"time":"2024-06-27T16:57:46.228615155Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/15-9731-1719507466228.tmp"}
2024-06-27T16:57:46.434Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
{"time":"2024-06-27T16:59:02.739344349Z","level":"INFO","msg":"Exiting flushing loop","component":"session"}
{"time":"2024-06-27T16:59:02.739401442Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"}
{"time":"2024-06-27T16:59:02.800452698Z","level":"INFO","msg":"store closed ","component":"store","nodeId":"1","listen_address":"localhost:4002"}
{"time":"2024-06-27T16:59:02.800512641Z","level":"ERROR","msg":"Exited process","err":"http: Server closed"}
{"time":"2024-06-27T16:59:03.599825893Z","level":"INFO","msg":"Opening store","component":"store","nodeId":"1","listening":"localhost:4002"}
{"time":"2024-06-27T16:59:03.608718775Z","level":"INFO","msg":"Setup log info","component":"store","firstIdxOnOpen":1,"lastIdxOnOpen":9731,"lastAppliedIdxOnOpen":0,"lastCommandIdxOnOpen":0}
2024-06-27T16:59:03.724Z [INFO] raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
2024-06-27T16:59:03.724Z [INFO] raft: entering follower state: follower="Node at localhost:4002 [Follower]" leader-address= leader-id=
{"time":"2024-06-27T16:59:03.725361508Z","level":"INFO","msg":"starting server","addr":":8080"}
{"time":"2024-06-27T16:59:03.725644953Z","level":"INFO","msg":"Start compaction loop","component":"lsm-tree","interval":"10m0s","compactSize":"16MiB"}
{"time":"2024-06-27T16:59:03.725798869Z","level":"INFO","msg":"Starting session flushing loop","component":"session","interval":"1m0s"}
{"time":"2024-06-27T16:59:03.725952546Z","level":"INFO","msg":"Starting grpc server","addr":"127.0.0.1:4002"}
2024-06-27T16:59:05.217Z [WARN] raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-06-27T16:59:05.217Z [INFO] raft: entering candidate state: node="Node at localhost:4002 [Candidate]" term=16
2024-06-27T16:59:05.226Z [DEBUG] raft: voting for self: term=16 id=1
2024-06-27T16:59:05.244Z [DEBUG] raft: calculated votes needed: needed=1 term=16
2024-06-27T16:59:05.244Z [DEBUG] raft: vote granted: from=1 term=16 tally=1
2024-06-27T16:59:05.244Z [INFO] raft: election won: term=16 tally=1
2024-06-27T16:59:05.244Z [INFO] raft: entering leader state: leader="Node at localhost:4002 [Leader]"
{"time":"2024-06-27T16:59:05.256401611Z","level":"INFO","msg":"first log applied since node start, log","component":"store","index":3}
2024-06-27T17:01:20.799Z [INFO] raft: starting snapshot up to: index=9734
{"time":"2024-06-27T17:01:20.79966202Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9734-1719507680799.tmp"}
2024-06-27T17:01:21.014Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:03:36.078Z [INFO] raft: starting snapshot up to: index=9736
{"time":"2024-06-27T17:03:36.078771019Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9736-1719507816078.tmp"}
2024-06-27T17:03:36.285Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:06:43.516Z [INFO] raft: starting snapshot up to: index=9737
{"time":"2024-06-27T17:06:43.516679031Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9737-1719508003516.tmp"}
2024-06-27T17:06:43.729Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:10:04.887Z [INFO] raft: starting snapshot up to: index=9739
{"time":"2024-06-27T17:10:04.887940415Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9739-1719508204887.tmp"}
2024-06-27T17:10:05.098Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:12:07.690Z [INFO] raft: starting snapshot up to: index=9742
{"time":"2024-06-27T17:12:07.690333977Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9742-1719508327690.tmp"}
2024-06-27T17:12:07.892Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:14:21.748Z [INFO] raft: starting snapshot up to: index=9743
{"time":"2024-06-27T17:14:21.749010333Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9743-1719508461748.tmp"}
2024-06-27T17:14:21.955Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
The crash still occurs, even after all snapshots have been deleted.
I am investigating this and working on a possible fix. I am also looking into the timeseries numbers issue.
The new storage I mentioned I was working on (based on roaring bitmaps is ready). So I will cleanup and implement migration path to the new backend. You will not lose any data.
I want migration to be automatic so you won't need to do anything. I hope to have a release candidate ready sometime next week.
The new storage is very big upgrade, so you will save a lot of storage cost and resources. Also it will ensure we get correct and consistent results.
@Ziedelth what is the size of your data/raftdb
file ?
$ ls -lash data/
total 6,5M
4,0K drwxr-xr-x 3 root root 4,0K juin 30 10:32 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars 13 13:45 ..
6,5M -rw------- 1 root root 8,0M juin 30 10:31 raftdb
4,0K drwxr-xr-x 3 root root 4,0K juin 27 18:59 snapshots
0 -rw------- 1 root root 0 juin 29 02:49 vince-snapshot-228007530
0 -rw------- 1 root root 0 juin 29 10:39 vince-snapshot-2757039809
0 -rw------- 1 root root 0 juin 27 05:52 vince-snapshot-532277653
Hi there! I've just seen that the docker container has crashed. Here is the error I got when I do a docker logs :