Found an issue today where there is an abort of compaction due to the presence of some data corruption at the end of the history file. Here's a dump of the last records being read:
[2023-03-31T06:28:50.662Z INFO streambed_logged] record StorableRecord { version: 0, headers: [], timestamp: None, key: 99079191802151056, value: [...], offset: 940165 }, remaining 469
[2023-03-31T06:28:50.664Z INFO streambed_logged] record StorableRecord { version: 0, headers: [], timestamp: None, key: 99079191802151056, value: [...], offset: 30 }, remaining 86 <-- THAT'S A WEIRD OFFSET!!!
[2023-03-31T06:28:50.665Z ERROR streambed_logged::compaction] Could not age the active file/locate end offset. Aborting compaction. Err(IoError(Custom { kind: Other, error: "bytes remaining on stream" }))
The error occurs on the last records of the history file. There shouldn't be an erroneous number of bytes at the end (86). The bytes don't make sense either as they don't start with 00 00 00 (which we expect). Here are the actual bytes on disk for that last record that can be read, and then the remaining 86 bytes:
00010000 00 00 00 cb 80 80 80 80 80 80 38 ed 02 8c ea 7d |..........8....}|
00010010 cd e3 ce c9 7a 9d ce eb 2c cd 73 56 cc ec d9 d9 |....z...,.sV....|
00010020 ec 5e 84 6b 6a 3f 52 dc 59 af 41 8e 85 06 d7 d0 |.^.kj?R.Y.A.....|
00010030 51 88 08 32 e7 75 2f e6 81 0a f7 2c a3 0d ba 79 |Q..2.u/....,...y|
00010040 f7 ce 11 53 03 56 41 dd ff 2f 38 33 08 05 60 a9 |...S.VA../83..`.|
00010050 29 3a 7a 1a 8c a8 c8 94 7c 50 5d 00 26 43 7e 77 |):z.....|P].&C~w|
00010060 69 88 a1 ce 56 19 8e b3 f2 f2 03 1a 42 09 ea 59 |i...V.......B..Y|
00010070 b0 cf 89 8d 3d 28 62 60 c4 2c 77 1b e4 90 17 99 |....=(b`.,w.....|
00010080 58 9d 25 55 20 fa 77 5d 76 05 8a f0 2e 95 cf 3a |X.%U .w]v......:|
00010090 b6 6c 94 52 7a 8b df 0a 17 27 3a e3 24 e3 54 60 |.l.Rz....':.$.T`|
000100a0 0a a3 11 98 26 a7 bf eb 48 73 11 48 15 85 e7 22 |....&...Hs.H..."|
000100b0 e8 3c e8 10 2e 71 af ce d2 af 19 c4 cf 3c 47 a9 |.<...q.......<G.|
000100c0 0e bc 5d 42 03 9b 4e 1e 21 e9 f1 41 6e bc 4f 7d |..]B..N.!..An.O}|
000100d0 b1 ec a9 d5 f9 57 6d 37 12 cf 4c 1a 29 fd da c4 |.....Wm7..L.)...|
000100e0 a2 ee f5 1e 6d 6b 80 8b 76 da 6d 15 05 e1 bf 7a |....mk..v.m....z|
000100f0 68 f9 58 97 86 ba c1 4c 11 68 ac ba a4 44 83 bd |h.X....L.h...D..|
00010100 d0 83 7f 70 ce 56 bf b5 c5 1c 57 17 cc 30 8e 11 |...p.V....W..0..|
00010110 3f 7f 2d fd 16 70 0d 87 e6 77 13 37 6e 92 35 6d |?.-..p...w.7n.5m|
00010120 ec 18 9d b3 1e
fd 3d 15 85 06 cc 6a 29 25 5c 47 |......=....j)%\G|
00010130 79 f2 3b be 05 2f b1 43 6d 60 4b 6b e5 46 32 c6 |y.;../.Cm`Kk.F2.|
00010140 5f 2f 5c 61 17 45 77 a0 60 5e 86 58 3a e8 4d ac |_/\a.Ew.`^.X:.M.|
00010150 76 64 30 ff 7e 8e b5 75 7e 4a ce 43 9d 4b d3 ac |vd0.~..u~J.C.K..|
00010160 2d 9d e6 98 49 f0 03 e6 52 44 4e fe 89 7e 0a e6 |-...I...RDN..~..|
00010170 68 9e 95 14 c7 97 de 8b b8 f6 00 |h..........|
fd 3d 15 isn't a valid start of a record (records start at least with 00 which is the version of storable record format in Logged.
Given the presence of the history file and the ancient history file, it appears as though compaction didn't complete fully. My guess is that the process of replacing the work file with the history file didn't complete entirely. It doesn't fully explain the corruption of the last record. What I'm wondering there is whether the last 86 bytes are in fact part of the previous record, noting that the last record also has an offset of 30 (offsets should always increase in the commit log).
I think we should rename the ancient history file to the history file on detection of it during startup. The presence of ancient history indicates that compaction didn't complete as it is the last file to be renamed. Renaming ancient history will force compaction to start again.
We might also consider truncating a file if a record cannot be read.
Found an issue today where there is an abort of compaction due to the presence of some data corruption at the end of the history file. Here's a dump of the last records being read:
The error occurs on the last records of the history file. There shouldn't be an erroneous number of bytes at the end (86). The bytes don't make sense either as they don't start with
00 00 00
(which we expect). Here are the actual bytes on disk for that last record that can be read, and then the remaining 86 bytes:fd 3d 15
isn't a valid start of a record (records start at least with00
which is the version of storable record format in Logged.Given the presence of the history file and the ancient history file, it appears as though compaction didn't complete fully. My guess is that the process of replacing the work file with the history file didn't complete entirely. It doesn't fully explain the corruption of the last record. What I'm wondering there is whether the last 86 bytes are in fact part of the previous record, noting that the last record also has an offset of 30 (offsets should always increase in the commit log).
I think we should rename the ancient history file to the history file on detection of it during startup. The presence of ancient history indicates that compaction didn't complete as it is the last file to be renamed. Renaming ancient history will force compaction to start again.
We might also consider truncating a file if a record cannot be read.