n-y-z-o / nyzoVerifier

Verifier for the Nyzo cryptocurrency
The Unlicense
73 stars 42 forks source link

601 sentinel isn't working #27

Closed orisica closed 2 years ago

orisica commented 4 years ago

Numerous people reported that latest sentinel versions aren't working properly, which could be the cause of recent cycle dropouts.

I had the same tracking problem on the newest version with sentinels. All is good after downgrade. Plenty of power and multiple always track verifiers per sentinel

i can second that sentinels with v601 fall behind in some cases. i dont know the reason yet, because similar setups have different results.

I found my sentinel had stopped freezing blocks after I upgraded it to v601. Unfortunately it was supposed to be protecting one of my friend's verifiers and it dropped out of the cycle. I downgraded my Sentinel to v598 and it's been working fine so far.

I have just verified this myself on a fresh sentinel installation I created yesterday, sentinel gets stuck on a certain block and doesn't move. Downgrading to v587 fixes the issue.

Here is a log:

froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% waiting for message queue to clear from thread [Thread-8], size is 1 requesting block with votes for height 8947804 [1599462571.662 (2020-09-07 07:09:31.662 UTC)]: trying to fetch BlockWithVotesRequest37 from e168...3ea5 froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% [1599462571.751 (2020-09-07 07:09:31.751 UTC)]: block-with-votes response is [BlockWithVotesResponse(block=[Block: v=2, height=8947804, hash=bf18...b752, id=a132...726e], votes=0)] froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% requesting block with votes for height 8947804 [1599462574.716 (2020-09-07 07:09:34.716 UTC)]: trying to fetch BlockWithVotesRequest37 from acc0...f75a [1599462574.861 (2020-09-07 07:09:34.861 UTC)]: block-with-votes response is [BlockWithVotesResponse(block=[Block: v=2, height=8947804, hash=bf18...b752, id=a132...726e], votes=0)] froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0% froze block [Block: v=2, height=8947803, hash=7d87...dace, id=5228...5e93], efficiency: 11.0%

EggPool commented 4 years ago

I can confirm several users reported newest sentinels "freezing".

I did experience that myself.
v595 sentinel fed by several in-cycles and "always_track_blockchain = 1" in-queue. plenty of disk space plenty of CPU and ram left

This sentinel regularly ends up stuck with
"Protecting verifiers: uncertain"
When frozen, all out of cycle say "0" and all in-cycle "10" Average TTL: 1.5 day

logs look like: froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% [..repeat..] froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% requesting block with votes for height 8831680 trying to fetch BlockWithVotesRequest37 from 34a7...04c9 froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% waiting for message queue to clear from thread [Thread-29], size is 1 waiting for message queue to clear from thread [Thread-19], size is 1 [..repeat..] waiting for message queue to clear from thread [Thread-26], size is 1 block-with-votes response is [BlockWithVotesResponse(block=[Block: v=2, height=8831680, hash=0acb...1d6c, id=eb4c...baa9], votes=0)] froze block [Block: v=2, height=8831679, hash=82c0...fa49, id=e503...af39], efficiency: 10,4% [..repeat..]

frozen block not moving anymore (the managed verifiers, however, do work and track blocks) Have to resync blocks from a healthy sentinel and restart.

EggPool commented 4 years ago

Update with more tests, since this is happening again and again, including with the newest version.

Context:

The issue:

Looking at the logs, they are always the same kind, be it when stuck or after a restart and stuck again. Roughly:

I first thought of a resource issue, since the main failing sentinel was on an HDD machine. I duplicated that always failing sentinel on a fast SSD machine with powerful CPU, 8 cores, 32Gb ram. The morning after both the original (v595) and the clone (v605, same managed_verifiers) were stuck, both with the above log scheme.

So, this seems to be pointing to something verifier related. However, while the stuck sentinel only had "votes=0" answers, the very same in-cycle verifiers were properly answering to other, working and tracking, sentinels and sending votes, tracking the chain.

At this stage, I'd look at the network/message level, see whether the block-with-votes response messages could be truncated or such.

What more can I do to help troubleshoot? I have real logs if needed.

razzle-san commented 4 years ago

As a case study, here's what happens. I have 3 sentinels watching some queued and some in-cycle. The first two are 597 and 587, both not working despite everything I've tried. Been mingling with them for days without any luck. The only one that's working is a v605 tracking some in-cycle verifiers (no queued).

sentinels

Later edit : I've taken off all of the queued verifiers from the one with 597 and although it took a while to catch up, it's tracking just fine. Seems to be an issue related to queued verifiers. The 587 has been restarted several times, now it's showing this : image

NeoDimi commented 3 years ago

I add a comment on the same issue with my experience in case it helps:

I couldn't get my sentinels to start protecting verifiers so I wiped the VPS of a sentinel clean (new install of ubuntu 18.04) and reinstalled the sentinel completely. Everything started off fine until the sentinel stopped tracking after a short while (see below)

image

Now I'm back with 10s in the web listener in the line of managed verifiers that are tracking the blockchain and 0s for the others.

image

Additional info: The managed verifiers tracking the blockchain seem to actually be tracking extract from the log:

storing new vote, height=9536524, hash=fc05...50b8 freezing block [Block: v=2, height=9536524, hash=fc05...50b8, id=2614...5bd3] with standard mechanism after registering block [Block: v=2, height=9536524, hash=fc05...50b8, id=2614...5bd3] in BlockchainMetricsManager(), cycleTransactionSum=∩10.179444, cycleLength=2319 BlockVoteManager: removing vote map of size 2301 cleaning up because block 9536524 was frozen casting late vote for height 9536524 ^^^^^^^^^^^^^^^^^^^^^ casting vote for height 9536524 sending verifier-removal vote to node at IP: 95.216.195.20 maximum cycle transaction amount=∩100000.000000, balance=∩72709327.076724 Unable to determine top verifier. This is normal; out-of-cycle verifiers are not typically notified about new-verifier votes. ^^^^^^^^^^^^^^^^^^^^^ casting vote for height 9536525 storing new vote, height=9536525, hash=fa0b...8b69 freezing block [Block: v=2, height=9536525, hash=fa0b...8b69, id=da0b...90e7] with standard mechanism after registering block [Block: v=2, height=9536525, hash=fa0b...8b69, id=da0b...90e7] in BlockchainMetricsManager(), cycleTransactionSum=∩10.179444, cycleLength=2319 BlockVoteManager: removing vote map of size 2315 cleaning up because block 9536525 was frozen casting late vote for height 9536525 ^^^^^^^^^^^^^^^^^^^^^ casting vote for height 9536525 sending verifier-removal vote to node at IP: 159.69.105.183 maximum cycle transaction amount=∩100000.000000, balance=∩72709327.076724

neither the sentinel nor the verifiers tracking the blockchain have a full disk (resp. 3%, 63%, 72% usage) and they seem to have plenty of cpu and ram capacity available. The sentinel has a total of 25 verifiers in its care The sentinel version is v606 and the verifiers are a mix of v600 and v601

I will try to downgrade the sentinel version to see if it works. I have an in-cycle verifier (not managed by the sentinel showed above) that is naked at the minute. I really need to get some sentinels to work and to protect verifiers.

orisica commented 3 years ago

Numerous reports of sentinels not working keep coming on Discord. The sentinel works for a bit and then randomly freezes on a specific block and stops protecting verifiers. Block delete and a restart fixes it temporarily until it gets stuck again after a few hours.

I can confirm that the issue is happening on the latest version (607).

MyAltcoins commented 3 years ago

I have tested all sentinel versions from 602 to 607, and all exhibit the same issue. Sentinels get stuck after a few hours to a few days on a particular block and stop protecting verifiers.

It happens even when sentinels are managing in-cycle verifiers only.

A fix would be greatly appreciated.

NeoDimi commented 3 years ago

I'm now running version 587 of the sentinel. It doesn't solve the problem by any means but seems to at least give me several days of uptime at a time.

The behavior is quite strange: I have 3 sentinels and each of my verifier is covered by 2 of them, so the sentinels are actually sharing a significant amount of verifiers. They have both verifiers in cycle and in the queue. Typically, over 10 to 15 days, sentinel 1 my work the whole time, sentinel 2 will need a restart once or twice and sentinel 3 will need a restart every 2 days. The next 10 or 15 days will see the same type of pattern, just now sentinel 3 is the stable one and sentinel 1 needs almost daily restarts... It seems very random to me. It does it's job as long as I check on the sentinels twice a day. I second MyAltcoins, a fix would be greatly appreciated.

EggPool commented 3 years ago

@MyAltcoins @NeoDimi Did you had a look at the sentinel logs once stuck, same pattern in the logs as what I describe https://github.com/n-y-z-o/nyzoVerifier/issues/27#issuecomment-704728227 ?

EggPool commented 3 years ago

Update: After days of debug and logs, I was able to make significant progress and will propose a fix very soon.

n-y-z-o commented 3 years ago

This issue may be fixed in version 608, if the issue that others are seeing is the same issue we have seen.

Our experience with Nyzo is actually more limited than many others in the cycle. With the joining of Argo 2080, we have only 10 in-cycle verifiers, and we have a total of 5 sentinels (including quark.nyzo.co). We did not experience this problem with any of our sentinels until doing some testing on an early version of 608 a few months ago. That doesn't mean this hasn't been a big problem -- we just haven't been able to experience the problem the way others have because of our limited setup.

We've seen a total of 3 times that a verifier became inactive. All 3 times, an invalid block was frozen. And the invalid blocks all had the same qualities in common:

After adding the signature check to the in-development 608 code, we added logging to see when a block with an invalid signature was received. This happened one time after adding the logging: an invalid block was received on quark.nyzo.co from Argo 752 for height 11509752. That block was verified by verifier 86eb...1898 (nickname A1200O4898). The account for 86ec...1898 had been emptied at block 11448867. Due to the signature check, the block was discarded, fetched again, and the sentinel continued without interruption.

The previous instances of stalls happened at blocks 11373452, 11438953, 11455852. If you look at these, you'll see the same pattern. They all contain meta transactions, and the accounts of the verifiers had been emptied recently before the blocks were verified. So, it makes sense to us that there's some protection causing the transactions to be removed from blocks because the accounts are mistaken to be empty. We added additional logging to try to figure out where the transactions were being stripped -- on the sender or in reassembly on the sentinel -- but we stopped receiving invalid blocks so we were unable to track further. We could have easily tested by deliberately producing invalid blocks, but this would have only further confirmed that the signature checks were performing as intended. It would not have gotten us any closer to understanding why invalid blocks were being received.

The lack of sentinel signature checks could have also been exploited maliciously, although not very effectively.

Again, we're not sure if we are even seeing the same issue that others are seeing, so this may not be helpful at all. If you are having problems with a particular sentinel, however, upgrading that sentinel to 608 might be worth considering.

EggPool commented 3 years ago

Coherent with what I traced down myself, on sentinels that were stuck very frequently See https://github.com/n-y-z-o/nyzoVerifier/pull/39 for detail

This explains more precisely why some blocks may be invalid.
I don't think it's only when the account was recently emptied. This happened way too often to me (every time however, a meta transaction was missing)

I also believe the fix in 608 solves the issue indirectly, by denying blocks with missing txs to be frozen.
The balance list hash check, however now redundant, is an added safety.

MyAltcoins commented 3 years ago

Can confirm that EggPool changes completely fixed sentinel issues.

NeoDimi commented 3 years ago

I'm updating one of my sentinel which often needed restarts with v608. I'll give a feedback in a few days. Thx for the fix!

NeoDimi commented 3 years ago

I'm updating one of my sentinel which often needed restarts with v608. I'll give a feedback in a few days. Thx for the fix!

It's been almost three weeks I didn't have to restart once a sentinel using this version. Kudos and thx !