Docker container crash - Githubissues

Ziedelth commented 3 days ago

Hi there! I've just seen that the docker container has crashed. Here is the error I got when I do a docker logs :

{"time":"2024-06-27T03:43:28.210460272Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9206-1719459808210.tmp"}
2024-06-27T03:43:28.404Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:46:14.454Z [INFO]  raft: starting snapshot up to: index=9210
{"time":"2024-06-27T03:46:14.454634112Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9210-1719459974454.tmp"}
2024-06-27T03:46:14.646Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:48:25.736Z [INFO]  raft: starting snapshot up to: index=9212
{"time":"2024-06-27T03:48:25.736896922Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9212-1719460105736.tmp"}
2024-06-27T03:48:25.906Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T03:52:12.032Z [INFO]  raft: starting snapshot up to: index=9219
{"time":"2024-06-27T03:52:12.032933328Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/13-9219-1719460332032.tmp"}
{"time":"2024-06-27T03:52:12.085446794Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"}
panic: arrow/array: field 15 has 18410 rows. want=18412
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xec2544]

goroutine 105 [running]:
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact.func1()
    /home/runner/work/vince/vince/internal/lsm/part.go:114 +0xa4
panic({0x18a4600?, 0xc01402a550?})
    /opt/hostedtoolcache/go/1.22.0/x64/src/runtime/panic.go:770 +0x132
github.com/apache/arrow/go/v15/arrow/array.(*RecordBuilder).NewRecord(0xc007530180)
    /home/runner/go/pkg/mod/github.com/apache/arrow/go/v15@v15.0.0/arrow/array/record.go:349 +0x28e
github.com/vinceanalytics/vince/internal/staples.(*Merger).NewRecord(0xc0075565e8?, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}})
    /home/runner/work/vince/vince/internal/staples/arrow.go:176 +0x67
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact(0xc0075301c0, 0xc00d579e58)
    /home/runner/work/vince/vince/internal/lsm/part.go:137 +0x145
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Compact(0xc0000ea100, 0x0)
    /home/runner/work/vince/vince/internal/lsm/lsm.go:268 +0xd6
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Start(0xc0000ea100, {0x1d4d048, 0xc007530c80})
    /home/runner/work/vince/vince/internal/lsm/lsm.go:240 +0x288
created by github.com/vinceanalytics/vince/internal/session.(*Session).Start in goroutine 1
    /home/runner/work/vince/vince/internal/session/session.go:139 +0x85
{"time":"2024-06-27T11:54:59.758964586Z","level":"INFO","msg":"Opening store","component":"store","nodeId":"1","listening":"localhost:4002"}
{"time":"2024-06-27T11:54:59.773558502Z","level":"INFO","msg":"Setup log info","component":"store","firstIdxOnOpen":1,"lastIdxOnOpen":9219,"lastAppliedIdxOnOpen":0,"lastCommandIdxOnOpen":0}
{"time":"2024-06-27T11:54:59.782291961Z","level":"WARN","msg":"found temporary snapshot","component":"snapshot","name":"13-9219-1719460332032.tmp"}
2024-06-27T11:55:00.400Z [INFO]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
{"time":"2024-06-27T11:55:00.40124848Z","level":"INFO","msg":"starting server","addr":":8080"}

Ziedelth commented 3 days ago

Another crash:

{"time":"2024-06-27T15:32:27.471176779Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9651-1719502347471.tmp"}
2024-06-27T15:32:27.674Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T15:35:00.319Z [INFO]  raft: starting snapshot up to: index=9657
{"time":"2024-06-27T15:35:00.320042783Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/14-9657-1719502500320.tmp"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0xec2544]

goroutine 111 [running]:
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact.func1()
    /home/runner/work/vince/vince/internal/lsm/part.go:114 +0xa4
github.com/vinceanalytics/vince/internal/lsm.(*PartStore).Compact(0xc007544200, 0xc0071dba10)
    /home/runner/work/vince/vince/internal/lsm/part.go:146 +0x2e5
github.com/vinceanalytics/vince/internal/lsm.(*Tree).Compact(0xc007176300, 0xc00d745008)
    /home/runner/work/vince/vince/internal/lsm/lsm.go:268 +0xd6
github.com/vinceanalytics/vince/internal/session.(*Session).Persist(0xc0072fa0f0, 0xc00d745008)
    /home/runner/work/vince/vince/internal/session/session.go:101 +0x30
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Backup(0xc00d64f0b0, {0x76bb44ea0820, 0xc00009ba40})
    /home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:38 +0x13e
github.com/vinceanalytics/vince/internal/cluster/snapshots.(*Arrow).Persist(0xc00d640570?, {0x1d4d390?, 0xc00009ba40?})
    /home/runner/work/vince/vince/internal/cluster/snapshots/lsm.go:27 +0x3e
github.com/hashicorp/raft.(*Raft).takeSnapshot(0xc00714c2c8)
    /home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/snapshot.go:190 +0x8c8
github.com/hashicorp/raft.(*Raft).runSnapshots(0xc00714c2c8)
    /home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/snapshot.go:82 +0x3ee
github.com/hashicorp/raft.(*raftState).goFunc.func1()
    /home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/state.go:149 +0x56
created by github.com/hashicorp/raft.(*raftState).goFunc in goroutine 1
    /home/runner/go/pkg/mod/github.com/hashicorp/raft@v1.6.1/state.go:147 +0x79

I don't know if it's the same error has previously or not

gernest commented 3 days ago

Hi, It seems there was a bug during compaction causing things to spiral

{"time":"2024-06-27T03:52:12.085446794Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"} panic: arrow/array: field 15 has 18410 rows. want=18412

This message signals that. This probably means a corrupt snapshot was saved . One remedy is to manually delete the offending snapshot. Deleting /app/data/snapshots/snapshots/13-9212-1719460105736.tmp might help.

I am still working on a better storage for vince using roaring bitmaps. All of these issues will go away.

Ziedelth commented 3 days ago

I deleted all the snapshots, but the problem seems to come back quite quickly.

{"time":"2024-06-27T16:57:46.228615155Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/15-9731-1719507466228.tmp"}
2024-06-27T16:57:46.434Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
{"time":"2024-06-27T16:59:02.739344349Z","level":"INFO","msg":"Exiting flushing loop","component":"session"}
{"time":"2024-06-27T16:59:02.739401442Z","level":"INFO","msg":"exiting compaction loop","component":"lsm-tree"}
{"time":"2024-06-27T16:59:02.800452698Z","level":"INFO","msg":"store closed ","component":"store","nodeId":"1","listen_address":"localhost:4002"}
{"time":"2024-06-27T16:59:02.800512641Z","level":"ERROR","msg":"Exited process","err":"http: Server closed"}
{"time":"2024-06-27T16:59:03.599825893Z","level":"INFO","msg":"Opening store","component":"store","nodeId":"1","listening":"localhost:4002"}
{"time":"2024-06-27T16:59:03.608718775Z","level":"INFO","msg":"Setup log info","component":"store","firstIdxOnOpen":1,"lastIdxOnOpen":9731,"lastAppliedIdxOnOpen":0,"lastCommandIdxOnOpen":0}
2024-06-27T16:59:03.724Z [INFO]  raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:1 Address:localhost:4002}]"
2024-06-27T16:59:03.724Z [INFO]  raft: entering follower state: follower="Node at localhost:4002 [Follower]" leader-address= leader-id=
{"time":"2024-06-27T16:59:03.725361508Z","level":"INFO","msg":"starting server","addr":":8080"}
{"time":"2024-06-27T16:59:03.725644953Z","level":"INFO","msg":"Start compaction loop","component":"lsm-tree","interval":"10m0s","compactSize":"16MiB"}
{"time":"2024-06-27T16:59:03.725798869Z","level":"INFO","msg":"Starting session flushing loop","component":"session","interval":"1m0s"}
{"time":"2024-06-27T16:59:03.725952546Z","level":"INFO","msg":"Starting grpc server","addr":"127.0.0.1:4002"}
2024-06-27T16:59:05.217Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id=
2024-06-27T16:59:05.217Z [INFO]  raft: entering candidate state: node="Node at localhost:4002 [Candidate]" term=16
2024-06-27T16:59:05.226Z [DEBUG] raft: voting for self: term=16 id=1
2024-06-27T16:59:05.244Z [DEBUG] raft: calculated votes needed: needed=1 term=16
2024-06-27T16:59:05.244Z [DEBUG] raft: vote granted: from=1 term=16 tally=1
2024-06-27T16:59:05.244Z [INFO]  raft: election won: term=16 tally=1
2024-06-27T16:59:05.244Z [INFO]  raft: entering leader state: leader="Node at localhost:4002 [Leader]"
{"time":"2024-06-27T16:59:05.256401611Z","level":"INFO","msg":"first log applied since node start, log","component":"store","index":3}
2024-06-27T17:01:20.799Z [INFO]  raft: starting snapshot up to: index=9734
{"time":"2024-06-27T17:01:20.79966202Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9734-1719507680799.tmp"}
2024-06-27T17:01:21.014Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:03:36.078Z [INFO]  raft: starting snapshot up to: index=9736
{"time":"2024-06-27T17:03:36.078771019Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9736-1719507816078.tmp"}
2024-06-27T17:03:36.285Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:06:43.516Z [INFO]  raft: starting snapshot up to: index=9737
{"time":"2024-06-27T17:06:43.516679031Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9737-1719508003516.tmp"}
2024-06-27T17:06:43.729Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:10:04.887Z [INFO]  raft: starting snapshot up to: index=9739
{"time":"2024-06-27T17:10:04.887940415Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9739-1719508204887.tmp"}
2024-06-27T17:10:05.098Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:12:07.690Z [INFO]  raft: starting snapshot up to: index=9742
{"time":"2024-06-27T17:12:07.690333977Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9742-1719508327690.tmp"}
2024-06-27T17:12:07.892Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"
2024-06-27T17:14:21.748Z [INFO]  raft: starting snapshot up to: index=9743
{"time":"2024-06-27T17:14:21.749010333Z","level":"INFO","msg":"creating new snapshot","component":"snapshot","path":"/app/data/snapshots/snapshots/16-9743-1719508461748.tmp"}
2024-06-27T17:14:21.955Z [ERROR] raft: failed to take snapshot: error="failed to persist snapshot: arrow/ipc: tried to write record batch with different schema"

Ziedelth commented 2 days ago

The crash still occurs, even after all snapshots have been deleted.

gernest commented 2 days ago

I am investigating this and working on a possible fix. I am also looking into the timeseries numbers issue.

The new storage I mentioned I was working on (based on roaring bitmaps is ready). So I will cleanup and implement migration path to the new backend. You will not lose any data.

I want migration to be automatic so you won't need to do anything. I hope to have a release candidate ready sometime next week.

The new storage is very big upgrade, so you will save a lot of storage cost and resources. Also it will ensure we get correct and consistent results.

gernest commented 1 day ago

@Ziedelth what is the size of your data/raftdb file ?

Ziedelth commented 22 hours ago

$ ls -lash data/
total 6,5M
4,0K drwxr-xr-x 3 root     root     4,0K juin  30 10:32 .
4,0K drwxrwxr-x 3 ziedelth ziedelth 4,0K mars  13 13:45 ..
6,5M -rw------- 1 root     root     8,0M juin  30 10:31 raftdb
4,0K drwxr-xr-x 3 root     root     4,0K juin  27 18:59 snapshots
   0 -rw------- 1 root     root        0 juin  29 02:49 vince-snapshot-228007530
   0 -rw------- 1 root     root        0 juin  29 10:39 vince-snapshot-2757039809
   0 -rw------- 1 root     root        0 juin  27 05:52 vince-snapshot-532277653

vinceanalytics / vince

Docker container crash #10