Open ancibanci opened 1 year ago
This validator was issuing quite a few disputes, we can see two reasons from the logs:
Failed to validate candidate para\_id=Id(2000) error=InvalidCandidate(PrepareError("panic: called `Option::unwrap()` on a `None` value"))
Location: /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/cranelift-codegen-0.88.0/src/isa/x64/encoding/rex.rs:478
Failed to validate candidate para\_id=Id(2090) error=InvalidCandidate(AmbiguousWorkerDeath)
The first error is really weird: Prepare, means during artifact preparation. So how can there be an Option::unwrap()
error, that is only happening on this validator? Do we have any clue what option that might be? @mrcnski ?
ambiguous worker death is also kind of unexpected, with 64GiG of RAM, OOM seems an unlikely reason.
Do we have any clue what option that might be?
I assume the backtrace points to that unwrap:
Dec 08 23:28:52 nd21b22 polkadot\[397866\]: The application panicked (crashed).
Dec 08 23:28:52 nd21b22 polkadot\[397866\]: Message: called `Option::unwrap()` on a `None` value
Dec 08 23:28:52 nd21b22 polkadot\[397866\]: Location: /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/cranelift-codegen-0.88.0/src/isa/x64/encoding/rex.rs:478
Following the links:
The first error is really weird: Prepare, means during artifact preparation. So how can there be an
Option::unwrap()
error, that is only happening on this validator? Do we have any clue what option that might be? @mrcnski ?
I saw this but didn't have any ideas, I'd have to sit down and do a deep dive.
Edit: it's a panic error, which most likely happened during the compilation itself.
ambiguous worker death is also kind of unexpected, with 64GiG of RAM, OOM seems an unlikely reason.
Is the "retry on AWD" change in this version? I noticed in the code that we don't log when we retry, but that would be nice to have.
Do you want me to perform some tests maybe? Should I rebuild my validator or just restart and monitor?
Do you use ECC memory in your setup?
reg.to_real_reg() ... some configuration mismatch in cranelift for that particular processor? In general, from my perspective this is most likely either a cranelift bug or some weird hardware fault. The former being more realistic.
@ancibanci what version of Polkadot are you running? And, yes rebuilding or using a differently provided binary and restarting might shed some more light.
From the logs: version 0.9.36-dc25abc712e
.
From the specs it looks like https://www.hetzner.com/dedicated-rootserver/ax41-nvme/ probably with Non-ECC RAM, is that correct?
From the previous reports it does feel like either corrupted storage or memory, but I'm hesitant to say that just because I don't have any other explanation.
@vstakhov No, I don't use ECC memory. @eskimor version is 0.9.36 @ordian The server is not on Hetzner, but on Mevspace, datacenter located in Warsaw.
I will try to rebuild and see what will happen, and I will write an update tomorrow.
Thank you!
Using non-ECC memory to run server software is a risky choice, as errors are likely to occur eventually due to the inherent nature of non-ECC memory.
After downloading binary and restarting, I started getting again the errors "buffer is full"
Jan 04 18:46:25 nd21b22 polkadot[1187609]: 2023-01-04 18:46:25 💤 Idle (40 peers), best: #16047168 (0xb747…6257), finalized #16047165 (0x769d…0f01), ⬇ 2.2MiB/s ⬆ 2.0M>
Jan 04 18:46:30 nd21b22 polkadot[1187609]: 2023-01-04 18:46:30 ✨ Imported #16047169 (0xd50c…7bfa)
Jan 04 18:46:30 nd21b22 polkadot[1187609]: 2023-01-04 18:46:30 ♻️ Reorg on #16047169,0xd50c…7bfa to #16047169,0xcb01…d34c, common ancestor #16047168,0xb747…6257
Jan 04 18:46:30 nd21b22 polkadot[1187609]: 2023-01-04 18:46:30 ✨ Imported #16047169 (0xcb01…d34c)
Jan 04 18:46:30 nd21b22 polkadot[1187609]: 2023-01-04 18:46:30 💤 Idle (40 peers), best: #16047169 (0xcb01…d34c), finalized #16047165 (0x769d…0f01), ⬇ 1.6MiB/s ⬆ 1.4M>
Jan 04 18:46:34 nd21b22 polkadot[1187609]: 2023-01-04 18:46:34 dropping (Stream 643314fa/67) because buffer is full
Jan 04 18:46:34 nd21b22 polkadot[1187609]: 2023-01-04 18:46:34 dropping (Stream 643314fa/69) because buffer is full
Jan 04 18:46:34 nd21b22 polkadot[1187609]: 2023-01-04 18:46:34 dropping (Stream 643314fa/71) because buffer is full
Jan 04 18:46:34 nd21b22 polkadot[1187609]: 2023-01-04 18:46:34 dropping (Stream 643314fa/73) because buffer is full
7 hours ago I got "bad assignment from peer"
Jan 05 05:11:40 nd21b22 polkadot[1221292]: 2023-01-05 05:11:40 💤 Idle (40 peers), best: #16053402 (0xaae7…2024), finalized #16053398 (0x84a4…5a6a), ⬇ 2.0MiB/s ⬆ 1.2M>
Jan 05 05:11:42 nd21b22 polkadot[1221292]: 2023-01-05 05:11:42 ✨ Imported #16053403 (0x97ca…5806)
Jan 05 05:11:45 nd21b22 polkadot[1221292]: 2023-01-05 05:11:45 💤 Idle (40 peers), best: #16053403 (0x97ca…5806), finalized #16053400 (0xbdd7…01ec), ⬇ 2.0MiB/s ⬆ 1.1M>
Jan 05 05:11:45 nd21b22 polkadot[1221292]: 2023-01-05 05:11:45 Got a bad assignment from peer hash=0x443636cf57f02b647a77718472b34d2ddf68ac0636322f97455d877fb6c2f26e >
Jan 05 05:11:45 nd21b22 polkadot[1221292]: 2023-01-05 05:11:45 Chain between (0x948c…be16, 16053401) and 16053398 not fully known. Forcing vote on 16053398 unknown_nu>
Jan 05 05:11:45 nd21b22 polkadot[1221292]: 2023-01-05 05:11:45 Got a bad assignment from peer hash=0xc0a46dd9573bfba881d1b11b19b4b2abf20616ac599e429c63e171a7b36c03a3 >
Jan 05 05:11:45 nd21b22 polkadot[1221292]: 2023-01-05 05:11:45 Got a bad assignment from peer hash=0xc0a46dd9573bfba881d1b11b19b4b2abf20616ac599e429c63e171a7b36c03a3 >
And my node got stuck at the block 16055683, and couldn't produce any blocks. It was again chilled very soon after that.
Jan 05 12:09:11 nd21b22 polkadot[1230308]: 2023-01-05 12:09:11 ♻️ Reorg on #16055683,0x6d7b…8a7f to #16055684,0x01a1…001d, common ancestor #16055682,0x255a…0396
Jan 05 12:09:11 nd21b22 polkadot[1230308]: 2023-01-05 12:09:11 ✨ Imported #16055684 (0x01a1…001d)
Jan 05 12:09:11 nd21b22 polkadot[1230308]: 2023-01-05 12:09:11 Candidate included without being backed? candidate_hash=0xe3a77990cde0032b0ca70dcfc10c62c87159809550e18>
Jan 05 12:09:11 nd21b22 polkadot[1230308]: 2023-01-05 12:09:11 Candidate included without being backed? candidate_hash=0x1556ed7038a7502d69107ccb5f36217ead41fd6185180>
Jan 05 12:09:11 nd21b22 polkadot[1230308]: 2023-01-05 12:09:11 Candidate included without being backed? candidate_hash=0xd7f1b4db5731f8d216f8023dbd3b074d748d0356a0063>
Jan 05 12:09:11 nd21b22 polkadot[1230308]: 2023-01-05 12:09:11 Candidate included without being backed? candidate_hash=0xb23583dcdcf82a651c375a44eb2362f37ab1745542742>
The logs themselves tell very little unfortunately. None of the logs after the restart should result in chilling. Could it be that the validator was already chilled?
Hi, I am running a validator on Kusama and I got some weir error in last few weeks which were appearing occasionally (three times in last 3 weeks).
My configuration: AMD Ryzen 5 3600, 6c/12t. 3.60GHz, 480 GB NVME, 64 GB, 1 Gbps
Then today I got different error which resulted in my node being chilled.