Closed jakubgs closed 5 years ago
What kind of metrics would we want to publish? Can you give me some examples @Swader? Maybe an example query for ES? Based on your example gist you want the metrics to per-host. Should they also be per-beacon node? Or should that be aggregated per-host?
Per node would be best, I assumed nodes would be named like the main key in that example.
So we're not publishing metrics per se - for metrics, we'll develop something more proper later on and we do have grafana plugged into the infra so we can play there. What I'm hoping to accomplish with this is have a regularly updated endpoint that's zero-impact on infra but which I can use as the snapshot of the latest state for each node on the infra, in order to display it on a website as in:
Testnet Status: Live ✔ Latest Epoch / Slot / Block: ZZZ / XXXX / YYYY (_What does this mean?_) Peers connected to master bootstrap: YY Validators at work: ZZZZ Committee size: XZY
This can then be turned into badges or whatnot, showing our current status in a prettier way and letting people check what our situation is.
So this is nothing mission critical, more cosmetic, but would be nice to have.
Oh, so this isn't supposed to be an aggregation of state over a period of time, but rather for publishing current state of the nodes. Makes sense. Thanks!
Okay, so lets talk metrics. What specific metrics do you want to collect form the logs we save in ES? You listed some stuff:
Latest Epoch / Slot / Block: ZZZ / XXXX / YYYY
Peers connected to master bootstrap: YY
Validators at work: ZZZZ
Committee size: XZY
Which values in the logs represent these?
I don't want you to waste your time on that, I'll deal with parsing those values out of the logs. All I would like you to do is to create some kind of infra script which would:
Fork chosen
Attestation received
Slot start
...
/output.json
so that the JSON objects list the last instance of each specified message per node on the server in a single locationPossible?
The final output I'm after is:
{
"node01-on-master-or-some-other-identifier": {
"Looking for peers": {whatever: data, the: message, has: for_us},
"Slot start": {whatever: data, the: message, has: for_us},
...
},
"node02-on-master-or-some-other-identifier": {
"Looking for peers": {whatever: data, the: message, has: for_us},
"Slot start": {whatever: data, the: message, has: for_us},
...
}
}
The {whatever: data, the: message, has: for_us}
part would be replaced by the actual log in JSON format as extracted from ES or wherever it's most convenient to extract from.
On the client side I will then extract these metrics by parsing them - I'll make a simple JS snippet parsing that endpoint and extracting whatever content I need to actually get a badge like "Current slot: 8900" etc.
So I made a simple script in Python to collect these logs for every node - all ~290 of them - and it's looking bad. It taks ~1 minute to query for all of them:
./collect.py 0.49s user 0.06s system 0% cpu 59.446 total
And the ES hosts start sweating a bit during those queries. I'm going to try to improve this with a multi query search, but we'll see.
Example result from the script: https://gist.github.com/jakubgs/815e9626cc96169b9d90aea208558dab
That seems like a lot of nodes. Are we really running that many? How many per testnet?
1*(2+5) + 9*(10) = 97
True, that is a bit much. I'll check that after lunch. Also, specifying timestamp range could also improve the query performance.
That's way too much I think, yeah.
You can limit timestamp to the last 5 minutes. All messages should have appeared in a 5 minute timeframe.
Ah I see, 97
was multiplied by 3
message types, which gives 291
. So yeah, that adds up to 291
unique queries which takes a minute.
Doing { 'range': { '@timestamp': { 'gt': 'now-5m' } } }
doesn't seem to change the performance much.
Takes around ~40-45 seconds.
I'm starting to think it would make more sense to make one big query to get all logs for beacon-node-*
services in the past 5 minutes, and then filter that in Python.
Do you care if it's:
- host1
- node1
- msg1
- msg2
- node2
- ...
- host2
- node1
...
or
- node1@host1
- msg1
- msg2
- node2@host1
- msg1
...
?
Okay, I've arrived at a query for all 3 messages takes 2 seconds, so that's nice, but for some nodes I still don't get any Attestation received
messages, despite using the now-10m
timestamp range.
Example:
{
"docker/beacon-node-rlpx-testnet1-0@node-06.do-ams3.nimbus.test": {
"Fork chosen": {
"lvl": "INF",
"ts": "2019-04-17 14:18:58+00:00",
"msg": "Fork chosen",
"tid": 1,
"newHeadBlockRoot": "E3506242",
"newHeadEpoch": 269,
"newHeadSlot": 4304
},
"Slot start": {
"lvl": "DBG",
"ts": "2019-04-17 14:18:58+00:00",
"msg": "Slot start",
"tid": 1,
"lastSlot": 17821,
"scheduledSlot": 17822,
"slot": 17956
}
}
}
For example look at this query;
program:"docker/beacon-node-rlpx-testnet1-1" AND
logsource:"node-09.do-ams3.nimbus.test" AND
message:"Fork chosen"
https://kibana.status.im/goto/77c137797574913467fb6d99d4fca3d8 There is not a single result for the past 30 minutes for that query. Is that normal?
There are Fork chosen
logs for that node if I look at 4 hours, but they stopped showing up about 2 hours ago. Other logs are just fine. Can you explain this behavior to me @Swader ?
Not sure what I'm supposed to do with that. Should I just leave the JSON without that message if it hasn't happened in the last N
minutes? Or should I increase range and query again? Or should I leave a null
placeholder in under that key?
Hmm. That's a good question, I have no immediate answer. I'll get back to you on that. I do not care if it's node1@host1, whatever is easier, as long as it's one endpoint per testnet.
Note that the three message types are examples - in total there will be about 10 or so messages to filter for.
If you keep this method, which is faster, and increase time range to 4 hours, does it still work fine?
Okay, I have the initial working version: https://github.com/status-im/infra-nimbus/commit/9fcbc601 Result:
$ curl -s https://nimbus-test-stats.status.im/ | head -n 40
{
"node-04.do-ams3.nimbus.test": {
"docker/beacon-node-rlpx-testnet1-1": {
"Fork chosen": {
"lvl": "INF",
"ts": "2019-04-17 16:55:07+00:00",
"msg": "Fork chosen",
"tid": 1,
"newHeadBlockRoot": "EF7AF756",
"newHeadEpoch": 1113,
"newHeadSlot": 17816
},
"Attestation received": {
"lvl": "DBG",
"ts": "2019-04-17 16:55:21+00:00",
"msg": "Attestation received",
"tid": 1,
"attestationData": {
"slot": 18269,
"beacon_block_root": "CDC54F7F",
"source_epoch": 0,
"target_root": "CDC54F7F",
"source_root": "00000000",
"shard": 13,
"previous_crosslink_epoch": 0,
"previous_crosslink_data_root": "00000000",
"crosslink_data_root": "00000000"
},
"signature": "8a5e8ef9"
},
"Slot start": {
"lvl": "DBG",
"ts": "2019-04-17 16:55:30+00:00",
"msg": "Slot start",
"tid": 1,
"lastSlot": 18269,
"scheduledSlot": 18270,
"slot": 18270
}
},
I'll be polishing this tomorrow.
Excellent, looking good! Now if you could have it read the messages I wish to extract from a file so that I can configure that myself as the messages change and new ones are added, that'd be perfect!
Now if you could have it read the messages I wish to extract from a file so that I can configure that myself as the messages change and new ones are added, that'd be perfect!
Defaults: https://github.com/status-im/infra-nimbus/blob/9fcbc6014f49997fd03247bfff7b12c9844b8740/ansible/roles/nimbus-stats/files/collect.py#L53-L57 Option: https://github.com/status-im/infra-nimbus/blob/9fcbc6014f49997fd03247bfff7b12c9844b8740/ansible/roles/nimbus-stats/files/collect.py#L74-L75
But I haven't tested it yet. I might have to add a callback to split the provided options.
I've made it so that you can add more types of messages by adding more -m
options.
But the simplest way is to just modify the DEFAULT_MESSAGES
list.
I also improved the query to do one for all types of messages: https://github.com/status-im/infra-nimbus/commit/3035b457b1925117f3f3a7e835604ab91a51a136
But even with that I feel like there is room for improvement. We are pulling 10k results just to use a tiny fraction of that, kinda ugly and not very efficient.
Agreed. Let's see how it performs. I'm noticing only one message from t0 and that's outdated, anything going wrong there? We just restarted it but at this point that one message is a few hours old. Everything else is t1.
I'll check it tomorrow, out today.
what's the plan for this json? anything that grafana can't be used for?
the alternative would be that we intro a separate metrics framework, like prometheus or graphite, should be fairly simple to add to the app
These metrics are already available in Grafana.
These metrics are already available in Grafana.
yes, though I'm curious why the custom collector script and json output.. point is, we can make life easier for this by changing the application as well, so we have something that's scalable..for example for those queries, perhaps there's some data we can add that will allow ES to index better or whatever.
The point is public read-only output that can serve as flair on posts, docs, READMEs and elsewhere. Bragging badges of the testnet's status, capable of also alerting us live if a testnet is down from the service's perspective. We can use grafana etc, but I don't know how feasible it is to make that into a public endpoint that could basically give us lightweight network stats like these, as per brainstorming above. If you have other ideas on how to handle this, please, I do agree this is clumsy.
So we've seen our ES nodes acting up yesterday due to the queries we've been making in Grafana: https://grafana.status.im/d/ogwyAVemk/nimbus?orgId=1 The issue is that the size of the indices in ES is getting out of hand. For the last 5 days the size of the indices has been ~13 GB, which is not sustainable. Not because of lack of disk space, but because the bigger the indices the more stuff ES has to load into memory, which requires more and more and more RAM. At this point we'll need 32GB of ram on each of 3 ES nodes we have to query the last two weeks.
This is not sustainable and we need a better solution for this. The beacon-node
software needs to have a proper way of exporting metrics so we can collect them in Prometheus as they should be. For example, the way we do it for status-go
is through this geth_exporter
package. It essentially calls JSON RPC API of status-go
and then exposes metrics it fetches that way through an HTTP /metrics
endpoint Prometheus can query.
We need a solution like that for beacon-node
, because using logs for collecting metrics is bad, and will not work on buggier scale or with wider time ranges.
You can read more about how a /metrics
endpoint should work in:
https://prometheus.io/docs/instrumenting/writing_exporters/
This really shouldn't be that difficult, whether it's added within beacon-node
or as a separate software we have to run. For any numeric value we want to track this is how we should do it. And for string values, those can be added to metrics through labels.
For example:
nimbus_some_topic_metric{topic="discovery", network="testnet0", protocol="rlpx"} 666
And for anything else that is more verbose, like for example the Connecting to bootstrap nodes
message which includes a bootstrapNodes
field which contains an enode
address, we can collect those via logs just fine. If the volume is reduced by properly exporting metrics as metrics we can do that too, without having to resort to ridiculous amounts of DEBUG
logs.
BTW, I've disabled DEBUG
logs on the nimbus.test
fleet for now, because this has gotten out of control and Igor was swarmed by alerts about ES hosts failing most of the day due to this. I'm also cleaning up ES indices by removing older Nimbus logs.
hm, yeah, cleanup ok, but there's utility for us to have the debug logs.. ie better to have debug logs that get cleaned up aggressively (weekly?) than have crippled logs forever.
This is not sustainable and we need a better solution for this.
this would be best tracked in the (public) nimbus repo I think - in particular so we can have a discussion about the shape of the support (push vs pull etc)
I've left most of April, but as I said, there is a difference between logs and metrics. I have no issue with keeping debug logs if they are really logs, not metrics disguised as logs. As I said in my other post:
And for anything else that is more verbose, like for example the
Connecting to bootstrap nodes
message which includes abootstrapNodes
field which contains anenode
address, we can collect those via logs just fine. If the volume is reduced by properly exporting metrics as metrics we can do that too, without having to resort to ridiculous amounts ofDEBUG
logs.
If we remove majority of those logs by moving most of the stuff to a proper metrics setup then I see no issue with collecting real DEBUG logs. But there needs to be a clear distinction. Most of the stuff right now in the logs is just metrics.
they're logs. we're extracting some from them but they're definitely not metrics. the data is needed to.. er.. debug things after the fect which is why we want the debug logs too.
What about messages like Unresolved block
and Block received
? Could they be turned into just metrics? I see they have string values like previous_block_root
or state_root
which appear to be addresses, but could those not be for example represented as a number relative to something?
If that was possible we could reduce the number of logs by ~70%
These are important checkpoints along the block processing pipeline that we use to reconstruct what happened in a tricky scenario - the roots allow us to establish the chain of events and spot reorderings and the like..
eventually we'll go down to INF and better broadcasts, and a lot of it will go away (ie once we run this for real, in production)
Okay, I've added a 4th node to the ElasticSearch cluster: node-04.do-ams3.log-store.hq
This should lessen the load on the others when querying and reduce size of shards:
Changes:
Hopefully this will keep the ES cluster alive while I'm away during first week of May. If it's not enough we'll need to bump instance types on all 4 from s-2vcpu-4gb
to s-4vcpu-8gb
.
It might take a while for shards to start spreading but it should help: https://es.status.im/
With the help from @zah we now have proper Prometheus metrics configured in:
The metrics look like this:
% curl -s "http://localhost:9200/metrics" | grep '^beacon_'
beacon_current_live_validators 0.0
beacon_current_live_validators_created 1572608831.0
beacon_previous_live_validators 0.0
beacon_previous_live_validators_created 1572608831.0
beacon_pending_deposits 0.0
beacon_pending_deposits_created 1572608831.0
beacon_processed_deposits_total 192.0
beacon_processed_deposits_total_created 1572608831.0
beacon_finalized_epoch 0.0
beacon_finalized_epoch_created 1572608831.0
beacon_finalized_root 0.0
beacon_finalized_root_created 1572608831.0
beacon_current_justified_epoch 0.0
beacon_current_justified_epoch_created 1572608831.0
beacon_current_justified_root 0.0
beacon_current_justified_root_created 1572608831.0
beacon_previous_justified_epoch 0.0
beacon_previous_justified_epoch_created 1572608831.0
beacon_previous_justified_root 0.0
beacon_previous_justified_root_created 1572608831.0
beacon_current_validators 192.0
beacon_current_validators_created 1572608831.0
beacon_previous_validators 192.0
beacon_previous_validators_created 1572608831.0
beacon_previous_epoch_orphaned_blocks 0.0
beacon_previous_epoch_orphaned_blocks_created 1572608831.0
beacon_reorgs_total_total 0.0
beacon_reorgs_total_created 1572608831.0
beacon_slot 8332.0
beacon_slot_created 1572608831.0
beacon_head_slot 8332.0
beacon_head_slot_created 1572608831.0
beacon_head_root 1.157048430285501e+17
beacon_head_root_created 1572608831.0
beacon_pending_exits 0.0
beacon_pending_exits_created 1572608831.0
beacon_attestations_sent_total 173.0
beacon_attestations_sent_created 1572608831.0
beacon_attestations_received_total 1659.0
beacon_attestations_received_created 1572608831.0
beacon_blocks_proposed_total 12.0
beacon_blocks_proposed_created 1572608831.0
beacon_blocks_received_total 836.0
beacon_blocks_received_created 1572608831.0
And now we have a proper metrics dashboard: https://grafana.status.im/d/lXQ3tP0Zz/nimbus-metrics Though it probably needs some work, as I don't really understand these metrics. Some descriptions would be nice too.
I've talked with @Swader about how to collect and publicly publish metrics available in beacon node logs. Currently the logs are being collected in our ES cluster and can be queried to collect information we need.
What he proposed is publishing JSON files with the data on a public HTTPS endpoint. Example: https://gist.github.com/Swader/5aecd54c27b46e175c8c1954c48cd30a