During the last upgrade, the consensus telemetry did not receive any of the CommunicatorConsumer events nor the FinalizationConsumer events.
I tracked this problem back to the TelemetryConsumer not being subscribed to sections of the distributor (see PR #4518 for the fix).
Resulting Problem
Metrika's node agent is ingesting a variety of Telemetry events (at the moment, they do this by scraping the logs for entries with the hotstuff.telemetry keyword):
OnFinalizedBlock
OnBlockIncorporated
OnCurrentViewDetails
OnOwnProposal
OnOwnVote
OnVoteProcessed
On the one hand, we have unit tests that verify that the respective components correctly publish those events by handing them into the injected pub-sub distributor. However, we have no integration tests that verify that these events are correctly propagated to the TelemetryConsumer
Goal:
We would like to have an integration test that verifies these events are successfully received by the TelemetryConsumer.
In the tests, please explicitly highlight that those events are ingested by Metrika and Metrika should be notified if the events change in their structure or naming.
Ideally, we would do a string matching to make sure the keywords that Metrika scrapes for are found in the log messages:
we expect hotstuff.telemetry to be part of the log string
the event name should be part of the log string (e.g. OnFinalizedBlock)
I think is it enough to verify that those events are emitted at all for a system that finalized a few blocks. In my opinion, this would already catch the vast majority of bugs.
My thoughts on priority and time investment:
Generally we don't change consensus very often. Hence, my gut feeling is that breaking changes are somewhat rare.
I don't know how complex it will be to implement a test that inspects the logs. Ideally, we desire an end-to-end test. If that is super time consuming, we could postpone implementing this issue for now and hope that there aren't many changes on the consensus parts that could break the telemetry 😅
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Context
During the last upgrade, the consensus telemetry did not receive any of the
CommunicatorConsumer
events nor theFinalizationConsumer
events. I tracked this problem back to theTelemetryConsumer
not being subscribed to sections of the distributor (see PR #4518 for the fix).Resulting Problem
Metrika's node agent is ingesting a variety of Telemetry events (at the moment, they do this by scraping the logs for entries with the
hotstuff.telemetry
keyword):OnFinalizedBlock
OnBlockIncorporated
OnCurrentViewDetails
OnOwnProposal
OnOwnVote
OnVoteProcessed
On the one hand, we have unit tests that verify that the respective components correctly publish those events by handing them into the injected pub-sub distributor. However, we have no integration tests that verify that these events are correctly propagated to the
TelemetryConsumer
Goal:
TelemetryConsumer
.hotstuff.telemetry
to be part of the log stringOnFinalizedBlock
)My thoughts on priority and time investment: