mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.85k stars 155 forks source link

Kind.of_magic: unexpected magic char '\000' #1861

Open tomjridge opened 2 years ago

tomjridge commented 2 years ago

https://app.slack.com/client/T59LZHQ11/C596FGNUR/thread/C596FGNUR-1654186374.376829

From that link:

Looks like we also have one v13.0 node with a strange error that I haven't seen before:

Jun 4 11:29:08.585 - node.validator.bootstrap_pipeline: in 79 steps
Jun 4 11:29:08.585 - node.validator.bootstrap_pipeline: fetching branch of about 2206 blocks from peer idqZGXnf4QdF461yEV4LJ2GHkfJAKp
Jun 4 11:29:18.230 - validator.peer: Worker crashed [node_peer_validator]:
Jun 4 11:29:18.230 - validator.peer: Kind.of_magic: unexpected magic char '\000'
Jun 4 11:29:18.230 - validator.peer: Request pushed on 2022-06-04T11:29:08.568-00:00, treated in 525us, completed in 9.661s ,
Jun 4 11:29:18.230 - validator.peer: Kind.of_magic: unexpected magic char '\000'
Jun 4 11:29:18.230 - validator.peer: New branch BKtqy7ZFfR3nPz6Xz2a3jLLzgR4oGbbxF6H8WkMjGdyhQxxPDFo, locator length 2631086
Jun 4 11:29:18.230 - node.validator.bootstrap_pipeline: Kind.of_magic: unexpected magic char '\000'
Jun 4 11:29:18.230 - node.validator.bootstrap_pipeline: unexpected error while fetching headers:
Jun 4 11:29:18.230 - validator.block: Kind.of_magic: unexpected magic char '\000'
Jun 4 11:29:18.230 - validator.block: Request pushed on 2022-06-04T11:29:09.162-00:00, treated in 17.47us, completed in 9.67s ,
Jun 4 11:29:18.230 - validator.block: validation of block BLCtcD31JJpFbLEDY6oiVK9UXkHtxDYqcHRBGG4b5GKW3BuuPPc failed Error: Kind.of_magic: unexpected magic char '\000'

The node was stuck for several hours. The problem did not go away after a simple restart. I then deleted peers.json and even the identity file, but after reconnect to one of our other nodes the problem did not go away.

Ngoguey42 commented 2 years ago

@tomjridge any news on that ?

tomjridge commented 2 years ago

I think @icristescu interacted with them. I don't think I saw anything further on the slack channels I'm signed up to. I guess if it is a problem they will raise an issue on tezos issues. So perhaps we can close this for now.

tomjridge commented 2 years ago

After checking the latest tezos issues, this may be related: https://gitlab.com/tezos/tezos/-/issues/3249

There is an initial report by @vishakh on tezos slack #general on Jun 2. This does not include the "Kind.of_magic" log line. In the thread conversation Alexander Eichhorn posts a similar report about a node "stuck for several hours" which does contain the "Kind.of_magic" error. That is the message I copied in the initial issue description above. User vishakh then opened tezos issue 3249.

tomjridge commented 2 years ago

Just a quick comment: the Kind.of_magic error fails because it reads a "0" byte, and this does not correspond to any of the object tags we use. One way this can happen is if there is a crash whilst appending data. A weak model of writing to the end of a file is that the file is first extended to the correct length, and filled with 0 bytes. Then the actual data is written over the 0 bytes. If there is a crash between the point the file is extended, and the point the data actually makes it to disk, there can be 0 bytes at the end of the file.

This model is described here: https://tom-ridge.blogspot.com/2022/07/a-model-of-file-update.html

(This isn't just a made up ad-hoc model - this is the "standard" mental model people use I believe when working with files.)

Of course, it could also be due to some other reason, e.g. some bad code writes a zero byte at the wrong offset for example.