mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.83k stars 154 forks source link

irmin-pack: error `unknown inode key` #2276

Open metanivek opened 11 months ago

metanivek commented 11 months ago

Tezos is experiencing a periodic issue on their testnet (has not been observed on mainnet):

Sep 25 20:38:10.105:   Error:
Sep 25 20:38:10.105:     {"Direct":["CoUnviMUq75Qy4i43Y18zSYf9TTTrCyrMHRyif4hWQLwA3EjD5kz",22250291624,360]}: unknown inode key (find_value)
Sep 25 20:38:10.105: 
Sep 25 20:38:10.106: worker crashed [validator-peer]:
Sep 25 20:38:10.106:   Error:
Sep 25 20:38:10.106:     {"Direct":["CoUnviMUq75Qy4i43Y18zSYf9TTTrCyrMHRyif4hWQLwA3EjD5kz",22250291624,360]}: unknown inode key (find_value)
Sep 25 20:38:10.106: 
Sep 25 20:38:10.120: fetching branch of 3 blocks from peer idtw8Zm1g1bUJEPyqDP6tC8EvjFv4u
Sep 25 20:38:10.120: validation of block BLEzAQFH6kzPu3Q9wKpCK68UscAhfMQwGZHLrK7e3NSuzdwaHDr
Sep 25 20:38:10.120:   failed but precheck succeeded,
Sep 25 20:38:10.120:   Request pushed on 2023-09-25T20:38:10.120-00:00, treated in 4.988us, completed in 77.456us:
Sep 25 20:38:10.120:   {"Direct":["CoUnviMUq75Qy4i43Y18zSYf9TTTrCyrMHRyif4hWQLwA3EjD5kz",22250291624,360]}: unknown inode key (find_value)
Sep 25 20:38:10.120: unexpected error while fetching headers:
Sep 25 20:38:10.120:   {"Direct":["CoUnviMUq75Qy4i43Y18zSYf9TTTrCyrMHRyif4hWQLwA3EjD5kz",22250291624,360]}: unknown inode key (find_value)
Sep 25 20:38:10.122: request
Sep 25 20:38:10.122:   New branch BLm8Dx1TwWBN7DA2YfiULePTR7z5BaVAFDnvcqvLshe3GySoAeZ, locator length 423475
Sep 25 20:38:10.122:   failed
Sep 25 20:38:10.122:   (Request pushed on 2023-09-25T20:38:10.119-00:00, treated in 253us, completed in 2.519ms):

We do not yet have reproduction steps, but here is what we know:

  1. Has only been observed on testnet
  2. Happens roughly at the same time across nodes that are run by the foundation
  3. These nodes regularly export snapshots for network backup
  4. It is unclear whether the error is coming from the read-write instance or the read-only instance
  5. Running irmin 3.7.2

When this occurred earlier this year, we guessed it might be related to GC and read-only instances not properly clearing the LRU. This was fixed in https://github.com/mirage/irmin/pull/2200.

We think there could be a bug in reading data that hasn't been flushed to disk since the append-only file does not allow reading beyond the last flushed offset, but this should never happen given higher level code in pack_store prevents this with the staging hash table (although this could use some inspection). Having a control file for the node at the time of this error would help us disprove this idea since we could compare the last written offset in the control file to the offset in the key that is logged.