Closed jinnatar closed 3 years ago
I cannot reproduce this on 0.5.1.
The logs posted in the issue description above are not triggering a skipped signage point event.
That can be quickly verified by anyone, if you replace the content of the scrambled.txt test case and re-run the unit tests.
For your reference, the test is checking that none of the debug log events in that log are generating any events:
I've gotten many of these notifications today, but when checking the logs, I can find actual skips - not false alarms from chiadog
. Most recent one being:
2021-05-20T19:53:28.925 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 62/64:
2021-05-20T19:53:38.107 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 63/64:
2021-05-20T19:53:59.905 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 1/64:
2021-05-20T19:54:08.305 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 2/64:
I found a case from earlier today when I got notification and also reproduces the issue with the test:
2021-05-20T18:43:34.798 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 8/64:
2021-05-20T18:43:42.919 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 9/64:
2021-05-20T18:43:51.653 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 10/64:
2021-05-20T18:44:10.151 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 12/64:
2021-05-20T18:44:11.085 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 11/64:
2021-05-20T18:44:19.641 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 13/64:
Not sure this should be classified as a bug however. Signage points should arrive every ~9 seconds. SP-11 was supposed to arrive by 18:44:01. It didn't arrive. It was skipped and SP-12 arrived as expected 9 seconds later.
The fact that we also got SP-11 together (or even after SP-12) doesn't mean it wasn't skipped. Question: does the consensus allow processing and responding to signage point challenges out of order? Even if the farmer responds to a challenge for SP-11, it'll be already at least 10 seconds late. So from my point of view it's as good as skipped.
What we can certainly do is to further throttle down the notifications for these occasional events however.
I am also getting a lot of warnings about skipped SP, but I can't find them in the log, so I guess I got the same problem. Where can I find the exact mentioned SP and timestamps for those?
Imho skipped and late SPs are two different things. What's annoying is, that I get multiple messages per hour via telegram about this issue. The warning says "Experiencing networking issues? Skipped 2+ signage points in the last hour. (non_skipped_signage_points.py:53)", but I get it several times per hour, every hour. Monitoring should inform about wrong things you can and should fix, not spam you with things that seem to be normal in the current chia network status. ;)
I propose warning thresholds should be configurable and out of order SPs should not be a warning or at least also configurable if it's a warning or not.
And btw Chia 1.1.6 was released today, which fixes some signate point and caching stuff. https://github.com/Chia-Network/chia-blockchain/commit/1c808b6c2910ed32fdbfdfc576ba1bc5a5adeac9 (But I still get multiple warnings per hour from chiadog.)
Further: great tool, thanks a lot for this!
The fact that we also got SP-11 together (or even after SP-12) doesn't mean it wasn't skipped. Question: does the consensus allow processing and responding to signage point challenges out of order? Even if the farmer responds to a challenge for SP-11, it'll be already at least 10 seconds late. So from my point of view it's as good as skipped.
To my understanding there is nothing that would forbid signing points out of order, since the P2P-Network is not reliable in this way. Also i don't think those have references to the previous SPs, so signing SP 5/64 first and then SP 4/64 should not be a problem consensus-wise. As far as the time you have for the SP goes, to my knowledge you have like 30sec, and since SPs come in around 8-10sec, you should be able to sign them out of order in the appropriate timeslot, but it's less likely that you will make it in time, yeah.
I'm considering removing these checks from chiadog at least for the time being. The harvester monitoring already does 99% of the important checks. Maybe will add a configuration with default state disabled.
I'm considering removing these checks from chiadog at least for the time being. The harvester monitoring already does 99% of the important checks. Maybe will add a configuration with default state disabled.
I would keep the checks and still count the number of SPs missed, but stop notifying if you miss some. This will happen regardless of how good the checks are implemented, since...well...it's the internet and P2P on top of that :) What i think is valuable to keep is the % of missed SPs on the daily report. I hope that stays.
Yes, I totally agree on that approach. Maybe also keep the alert notifications but only for >5% of skips over the past hour and not for individual occurrences 😄
Yes, I totally agree on that approach. Maybe also keep the alert notifications but only for >5% of skips over the past hour and not for individual occurrences
I'd love to see that % value configurable, with the default set to lets say something between 2-5%
Thanks for this not being a warning event anymore!
This just triggered a "skipped" log info in v0.6.0 in my farmer log:
2021-05-26T02:53:41.229 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 34/64:
2021-05-26T02:54:00.457 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 36/64:
2021-05-26T02:54:00.631 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 35/64:
2021-05-26T02:54:09.634 full_node chia.full_node.full_node: INFO ⏲️ Finished signage point 37/64:
Detected 1 skipped signage point.This is expected to happen occasionally and not a reason for concern. (non_skipped_signage_points.py:39)
(A missing space btw. ;))
But why is a cached and instantly catched up signage point still a "skipped" signage point? You could log a more precise info like "Finished (cached) signage point out of order, but nothing missed"?
Great work!
Thanks for the feedback. This is a limitation of the current implementation which only looks at the current state and history. Fixing that would add substantial complexity (and potential bugs) since it requires also looking into the future - and it's not clear how much into the future is sufficient - sometimes it's only 1 SP, sometimes 10+ SPs (during soft-fork). One way would be to verify all signage points in chunks of 64, and waiting to see if all 64/64 SPs were received, though that also introduces interesting edge cases with soft-forks in-between these "chunks".
It's not something I intend to fix because the impact is small and time investment is big. It could be a fun challenge to handle all scenarios properly, so if anyone has the time and willingness to look into it, contributions are welcome :-)
I agree this is not worth the additional effort.
Describe the bug If signage points are logged out of order, skip alerts are triggered.
Chia debug log:
Chiadog log from same time:
Notice signage point 12 that is logged before 11 completes.
Environment: