martomi / chiadog

A watch dog providing a peace in mind that your Chia farm is running smoothly 24/7.
MIT License
457 stars 121 forks source link

Signage points out of order are considered skips #145

Closed jinnatar closed 3 years ago

jinnatar commented 3 years ago

Describe the bug If signage points are logged out of order, skip alerts are triggered.

Chia debug log:

2021-05-19T13:14:51.592 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 9/64: 99dc2878cd36e748f35caf8b48fc3e048dc1d2306c6e5675a51f5a6beaaf8826
2021-05-19T13:14:52.165 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 10/64: fbd8a6cbfd9ceb973842b998f9e221791495bb0b95bb594078ff9022dd0c64c4
2021-05-19T13:15:24.908 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 12/64: 3e0f1d609eef221f21a5bb8f5c5290380563d265da219f49fca47c4cb1b8392d
2021-05-19T13:15:24.912 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 11/64: 2855217036e581b98e22e8776b41e61495751a7b2a9da27976120fef39c65e5a
2021-05-19T13:15:24.914 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 13/64: a5b4b7c0456eea6711776dd7781a828bac626e49f8f5f3083e031e456ffea0c7
2021-05-19T13:15:25.875 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 14/64: 2971d057ac883958a8e873d06d507ff66c13b58b30bc7a20e08014e2fb4dce10
2021-05-19T13:15:34.107 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 15/64: 520b8c1cc85449fc82351528e63e232f8d6bd85dc58a6a1abb11d486ea3b3f8f
2021-05-19T13:15:43.902 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 16/64: a0aa4580f66ee139c222297acd04f618bc1f5cfaa7aab75d76f63237a2ecdb7f

Chiadog log from same time:

[2021-05-19 13:15:24] [    INFO] --- Detected 1 skipped signage point. (non_skipped_signage_points.py:46)
[2021-05-19 13:15:24] [ WARNING] --- Experiencing networking issues? Skipped 2+ signage points in the last hour (non_skipped_signage_points.py:53)
[2021-05-19 13:15:25] [    INFO] --- Last farming event was 32 seconds ago. This is unusual (time_since_last_farm_event.py:47)

Notice signage point 12 that is logged before 11 completes.

Environment:

martomi commented 3 years ago

I cannot reproduce this on 0.5.1.

The logs posted in the issue description above are not triggering a skipped signage point event.

That can be quickly verified by anyone, if you replace the content of the scrambled.txt test case and re-run the unit tests.

For your reference, the test is checking that none of the debug log events in that log are generating any events:

https://github.com/martomi/chiadog/blob/67328d7960608179f8a4d9858a70955de29932f9/tests/chia_log/handlers/test_finished_signage_point_handler.py#L59

I've gotten many of these notifications today, but when checking the logs, I can find actual skips - not false alarms from chiadog. Most recent one being:

2021-05-20T19:53:28.925 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 62/64:
2021-05-20T19:53:38.107 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 63/64: 
2021-05-20T19:53:59.905 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 1/64: 
2021-05-20T19:54:08.305 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 2/64:
martomi commented 3 years ago

I found a case from earlier today when I got notification and also reproduces the issue with the test:

2021-05-20T18:43:34.798 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 8/64:
2021-05-20T18:43:42.919 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 9/64: 
2021-05-20T18:43:51.653 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 10/64: 
2021-05-20T18:44:10.151 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 12/64: 
2021-05-20T18:44:11.085 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 11/64: 
2021-05-20T18:44:19.641 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 13/64: 

Not sure this should be classified as a bug however. Signage points should arrive every ~9 seconds. SP-11 was supposed to arrive by 18:44:01. It didn't arrive. It was skipped and SP-12 arrived as expected 9 seconds later.

The fact that we also got SP-11 together (or even after SP-12) doesn't mean it wasn't skipped. Question: does the consensus allow processing and responding to signage point challenges out of order? Even if the farmer responds to a challenge for SP-11, it'll be already at least 10 seconds late. So from my point of view it's as good as skipped.

What we can certainly do is to further throttle down the notifications for these occasional events however.

djdookie commented 3 years ago

I am also getting a lot of warnings about skipped SP, but I can't find them in the log, so I guess I got the same problem. Where can I find the exact mentioned SP and timestamps for those?

Imho skipped and late SPs are two different things. What's annoying is, that I get multiple messages per hour via telegram about this issue. The warning says "Experiencing networking issues? Skipped 2+ signage points in the last hour. (non_skipped_signage_points.py:53)", but I get it several times per hour, every hour. Monitoring should inform about wrong things you can and should fix, not spam you with things that seem to be normal in the current chia network status. ;)

I propose warning thresholds should be configurable and out of order SPs should not be a warning or at least also configurable if it's a warning or not.

And btw Chia 1.1.6 was released today, which fixes some signate point and caching stuff. https://github.com/Chia-Network/chia-blockchain/commit/1c808b6c2910ed32fdbfdfc576ba1bc5a5adeac9 (But I still get multiple warnings per hour from chiadog.)

Further: great tool, thanks a lot for this!

azaryc2m commented 3 years ago

The fact that we also got SP-11 together (or even after SP-12) doesn't mean it wasn't skipped. Question: does the consensus allow processing and responding to signage point challenges out of order? Even if the farmer responds to a challenge for SP-11, it'll be already at least 10 seconds late. So from my point of view it's as good as skipped.

To my understanding there is nothing that would forbid signing points out of order, since the P2P-Network is not reliable in this way. Also i don't think those have references to the previous SPs, so signing SP 5/64 first and then SP 4/64 should not be a problem consensus-wise. As far as the time you have for the SP goes, to my knowledge you have like 30sec, and since SPs come in around 8-10sec, you should be able to sign them out of order in the appropriate timeslot, but it's less likely that you will make it in time, yeah.

martomi commented 3 years ago

I'm considering removing these checks from chiadog at least for the time being. The harvester monitoring already does 99% of the important checks. Maybe will add a configuration with default state disabled.

azaryc2m commented 3 years ago

I'm considering removing these checks from chiadog at least for the time being. The harvester monitoring already does 99% of the important checks. Maybe will add a configuration with default state disabled.

I would keep the checks and still count the number of SPs missed, but stop notifying if you miss some. This will happen regardless of how good the checks are implemented, since...well...it's the internet and P2P on top of that :) What i think is valuable to keep is the % of missed SPs on the daily report. I hope that stays.

martomi commented 3 years ago

Yes, I totally agree on that approach. Maybe also keep the alert notifications but only for >5% of skips over the past hour and not for individual occurrences 😄

azaryc2m commented 3 years ago

Yes, I totally agree on that approach. Maybe also keep the alert notifications but only for >5% of skips over the past hour and not for individual occurrences

I'd love to see that % value configurable, with the default set to lets say something between 2-5%

djdookie commented 3 years ago

Thanks for this not being a warning event anymore!

This just triggered a "skipped" log info in v0.6.0 in my farmer log:

2021-05-26T02:53:41.229 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 34/64: 
2021-05-26T02:54:00.457 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 36/64: 
2021-05-26T02:54:00.631 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 35/64: 
2021-05-26T02:54:09.634 full_node chia.full_node.full_node: INFO     ⏲️  Finished signage point 37/64: 

Detected 1 skipped signage point.This is expected to happen occasionally and not a reason for concern. (non_skipped_signage_points.py:39) (A missing space btw. ;))

But why is a cached and instantly catched up signage point still a "skipped" signage point? You could log a more precise info like "Finished (cached) signage point out of order, but nothing missed"?

Great work!

martomi commented 3 years ago

Thanks for the feedback. This is a limitation of the current implementation which only looks at the current state and history. Fixing that would add substantial complexity (and potential bugs) since it requires also looking into the future - and it's not clear how much into the future is sufficient - sometimes it's only 1 SP, sometimes 10+ SPs (during soft-fork). One way would be to verify all signage points in chunks of 64, and waiting to see if all 64/64 SPs were received, though that also introduces interesting edge cases with soft-forks in-between these "chunks".

It's not something I intend to fix because the impact is small and time investment is big. It could be a fun challenge to handle all scenarios properly, so if anyone has the time and willingness to look into it, contributions are welcome :-)

djdookie commented 3 years ago

I agree this is not worth the additional effort.