status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

A way to collect metrics from Nimbus beacon nodes #1

Closed jakubgs closed 5 years ago

jakubgs commented 5 years ago

I've talked with @Swader about how to collect and publicly publish metrics available in beacon node logs. Currently the logs are being collected in our ES cluster and can be queried to collect information we need.

What he proposed is publishing JSON files with the data on a public HTTPS endpoint. Example: https://gist.github.com/Swader/5aecd54c27b46e175c8c1954c48cd30a

jakubgs commented 5 years ago

What kind of metrics would we want to publish? Can you give me some examples @Swader? Maybe an example query for ES? Based on your example gist you want the metrics to per-host. Should they also be per-beacon node? Or should that be aggregated per-host?

Swader commented 5 years ago

Per node would be best, I assumed nodes would be named like the main key in that example.

So we're not publishing metrics per se - for metrics, we'll develop something more proper later on and we do have grafana plugged into the infra so we can play there. What I'm hoping to accomplish with this is have a regularly updated endpoint that's zero-impact on infra but which I can use as the snapshot of the latest state for each node on the infra, in order to display it on a website as in:

Testnet Status: Live ✔ Latest Epoch / Slot / Block: ZZZ / XXXX / YYYY (_What does this mean?_) Peers connected to master bootstrap: YY Validators at work: ZZZZ Committee size: XZY

This can then be turned into badges or whatnot, showing our current status in a prettier way and letting people check what our situation is.

So this is nothing mission critical, more cosmetic, but would be nice to have.

jakubgs commented 5 years ago

Oh, so this isn't supposed to be an aggregation of state over a period of time, but rather for publishing current state of the nodes. Makes sense. Thanks!

jakubgs commented 5 years ago

Okay, so lets talk metrics. What specific metrics do you want to collect form the logs we save in ES? You listed some stuff:

Latest Epoch / Slot / Block: ZZZ / XXXX / YYYY
Peers connected to master bootstrap: YY
Validators at work: ZZZZ
Committee size: XZY

Which values in the logs represent these?

Swader commented 5 years ago

I don't want you to waste your time on that, I'll deal with parsing those values out of the logs. All I would like you to do is to create some kind of infra script which would:

  1. read a file, e.g. "msgsToKeep.txt", with content like this:
    Fork chosen
    Attestation received
    Slot start
    ...
  2. For each line in that file, fine the LAST instance of that message (this is the "message" key of the logs) per node's logs
  3. Alter some endpoint /output.json so that the JSON objects list the last instance of each specified message per node on the server in a single location
  4. Make it repeat once per minute

Possible?

The final output I'm after is:

{
  "node01-on-master-or-some-other-identifier": {
    "Looking for peers": {whatever: data, the: message, has: for_us},
    "Slot start": {whatever: data, the: message, has: for_us},
    ...
  },
  "node02-on-master-or-some-other-identifier": {
    "Looking for peers": {whatever: data, the: message, has: for_us},
    "Slot start": {whatever: data, the: message, has: for_us},
    ...
  }
}

The {whatever: data, the: message, has: for_us} part would be replaced by the actual log in JSON format as extracted from ES or wherever it's most convenient to extract from.

On the client side I will then extract these metrics by parsing them - I'll make a simple JS snippet parsing that endpoint and extracting whatever content I need to actually get a badge like "Current slot: 8900" etc.

jakubgs commented 5 years ago

So I made a simple script in Python to collect these logs for every node - all ~290 of them - and it's looking bad. It taks ~1 minute to query for all of them:

./collect.py  0.49s user 0.06s system 0% cpu 59.446 total

And the ES hosts start sweating a bit during those queries. I'm going to try to improve this with a multi query search, but we'll see.

jakubgs commented 5 years ago

Example result from the script: https://gist.github.com/jakubgs/815e9626cc96169b9d90aea208558dab

Swader commented 5 years ago

That seems like a lot of nodes. Are we really running that many? How many per testnet?

jakubgs commented 5 years ago

https://github.com/status-im/infra-nimbus/blob/9ec6d1b4a98a5fef96003cbda79a678aa59065aa/ansible/main.yml#L7-L28 So:

1*(2+5) + 9*(10) = 97

True, that is a bit much. I'll check that after lunch. Also, specifying timestamp range could also improve the query performance.

Swader commented 5 years ago

That's way too much I think, yeah.

You can limit timestamp to the last 5 minutes. All messages should have appeared in a 5 minute timeframe.

jakubgs commented 5 years ago

Ah I see, 97 was multiplied by 3 message types, which gives 291. So yeah, that adds up to 291 unique queries which takes a minute.

jakubgs commented 5 years ago

Doing { 'range': { '@timestamp': { 'gt': 'now-5m' } } } doesn't seem to change the performance much. Takes around ~40-45 seconds.

I'm starting to think it would make more sense to make one big query to get all logs for beacon-node-* services in the past 5 minutes, and then filter that in Python.

jakubgs commented 5 years ago

Do you care if it's:

- host1
   - node1
      - msg1
      - msg2
   - node2
      - ...
- host2
   - node1
...

or

- node1@host1
  - msg1
  - msg2
- node2@host1
  - msg1
...

?

jakubgs commented 5 years ago

Okay, I've arrived at a query for all 3 messages takes 2 seconds, so that's nice, but for some nodes I still don't get any Attestation received messages, despite using the now-10m timestamp range. Example:

{
  "docker/beacon-node-rlpx-testnet1-0@node-06.do-ams3.nimbus.test": {
    "Fork chosen": {
      "lvl": "INF",
      "ts": "2019-04-17 14:18:58+00:00",
      "msg": "Fork chosen",
      "tid": 1,
      "newHeadBlockRoot": "E3506242",
      "newHeadEpoch": 269,
      "newHeadSlot": 4304
    },
    "Slot start": {
      "lvl": "DBG",
      "ts": "2019-04-17 14:18:58+00:00",
      "msg": "Slot start",
      "tid": 1,
      "lastSlot": 17821,
      "scheduledSlot": 17822,
      "slot": 17956
    }
  }
}
jakubgs commented 5 years ago

For example look at this query;

program:"docker/beacon-node-rlpx-testnet1-1" AND
logsource:"node-09.do-ams3.nimbus.test" AND
message:"Fork chosen"

https://kibana.status.im/goto/77c137797574913467fb6d99d4fca3d8 There is not a single result for the past 30 minutes for that query. Is that normal?

jakubgs commented 5 years ago

There are Fork chosen logs for that node if I look at 4 hours, but they stopped showing up about 2 hours ago. Other logs are just fine. Can you explain this behavior to me @Swader ?

jakubgs commented 5 years ago

Not sure what I'm supposed to do with that. Should I just leave the JSON without that message if it hasn't happened in the last N minutes? Or should I increase range and query again? Or should I leave a null placeholder in under that key?

Swader commented 5 years ago

Hmm. That's a good question, I have no immediate answer. I'll get back to you on that. I do not care if it's node1@host1, whatever is easier, as long as it's one endpoint per testnet.

Note that the three message types are examples - in total there will be about 10 or so messages to filter for.

If you keep this method, which is faster, and increase time range to 4 hours, does it still work fine?

jakubgs commented 5 years ago

Okay, I have the initial working version: https://github.com/status-im/infra-nimbus/commit/9fcbc601 Result:

 $ curl -s https://nimbus-test-stats.status.im/ | head -n 40
{
    "node-04.do-ams3.nimbus.test": {
        "docker/beacon-node-rlpx-testnet1-1": {
            "Fork chosen": {
                "lvl": "INF",
                "ts": "2019-04-17 16:55:07+00:00",
                "msg": "Fork chosen",
                "tid": 1,
                "newHeadBlockRoot": "EF7AF756",
                "newHeadEpoch": 1113,
                "newHeadSlot": 17816
            },
            "Attestation received": {
                "lvl": "DBG",
                "ts": "2019-04-17 16:55:21+00:00",
                "msg": "Attestation received",
                "tid": 1,
                "attestationData": {
                    "slot": 18269,
                    "beacon_block_root": "CDC54F7F",
                    "source_epoch": 0,
                    "target_root": "CDC54F7F",
                    "source_root": "00000000",
                    "shard": 13,
                    "previous_crosslink_epoch": 0,
                    "previous_crosslink_data_root": "00000000",
                    "crosslink_data_root": "00000000"
                },
                "signature": "8a5e8ef9"
            },
            "Slot start": {
                "lvl": "DBG",
                "ts": "2019-04-17 16:55:30+00:00",
                "msg": "Slot start",
                "tid": 1,
                "lastSlot": 18269,
                "scheduledSlot": 18270,
                "slot": 18270
            }
        },
jakubgs commented 5 years ago

I'll be polishing this tomorrow.

Swader commented 5 years ago

Excellent, looking good! Now if you could have it read the messages I wish to extract from a file so that I can configure that myself as the messages change and new ones are added, that'd be perfect!

jakubgs commented 5 years ago

Now if you could have it read the messages I wish to extract from a file so that I can configure that myself as the messages change and new ones are added, that'd be perfect!

Defaults: https://github.com/status-im/infra-nimbus/blob/9fcbc6014f49997fd03247bfff7b12c9844b8740/ansible/roles/nimbus-stats/files/collect.py#L53-L57 Option: https://github.com/status-im/infra-nimbus/blob/9fcbc6014f49997fd03247bfff7b12c9844b8740/ansible/roles/nimbus-stats/files/collect.py#L74-L75

But I haven't tested it yet. I might have to add a callback to split the provided options.

jakubgs commented 5 years ago

I've made it so that you can add more types of messages by adding more -m options. But the simplest way is to just modify the DEFAULT_MESSAGES list.

I also improved the query to do one for all types of messages: https://github.com/status-im/infra-nimbus/commit/3035b457b1925117f3f3a7e835604ab91a51a136

But even with that I feel like there is room for improvement. We are pulling 10k results just to use a tiny fraction of that, kinda ugly and not very efficient.

Swader commented 5 years ago

Agreed. Let's see how it performs. I'm noticing only one message from t0 and that's outdated, anything going wrong there? We just restarted it but at this point that one message is a few hours old. Everything else is t1.

jakubgs commented 5 years ago

I'll check it tomorrow, out today.

arnetheduck commented 5 years ago

what's the plan for this json? anything that grafana can't be used for?

the alternative would be that we intro a separate metrics framework, like prometheus or graphite, should be fairly simple to add to the app

jakubgs commented 5 years ago

These metrics are already available in Grafana.

arnetheduck commented 5 years ago

These metrics are already available in Grafana.

yes, though I'm curious why the custom collector script and json output.. point is, we can make life easier for this by changing the application as well, so we have something that's scalable..for example for those queries, perhaps there's some data we can add that will allow ES to index better or whatever.

Swader commented 5 years ago

The point is public read-only output that can serve as flair on posts, docs, READMEs and elsewhere. Bragging badges of the testnet's status, capable of also alerting us live if a testnet is down from the service's perspective. We can use grafana etc, but I don't know how feasible it is to make that into a public endpoint that could basically give us lightweight network stats like these, as per brainstorming above. If you have other ideas on how to handle this, please, I do agree this is clumsy.

jakubgs commented 5 years ago

So we've seen our ES nodes acting up yesterday due to the queries we've been making in Grafana: https://grafana.status.im/d/ogwyAVemk/nimbus?orgId=1 The issue is that the size of the indices in ES is getting out of hand. For the last 5 days the size of the indices has been ~13 GB, which is not sustainable. Not because of lack of disk space, but because the bigger the indices the more stuff ES has to load into memory, which requires more and more and more RAM. At this point we'll need 32GB of ram on each of 3 ES nodes we have to query the last two weeks.

This is not sustainable and we need a better solution for this. The beacon-node software needs to have a proper way of exporting metrics so we can collect them in Prometheus as they should be. For example, the way we do it for status-go is through this geth_exporter package. It essentially calls JSON RPC API of status-go and then exposes metrics it fetches that way through an HTTP /metrics endpoint Prometheus can query.

We need a solution like that for beacon-node, because using logs for collecting metrics is bad, and will not work on buggier scale or with wider time ranges. You can read more about how a /metrics endpoint should work in: https://prometheus.io/docs/instrumenting/writing_exporters/

This really shouldn't be that difficult, whether it's added within beacon-node or as a separate software we have to run. For any numeric value we want to track this is how we should do it. And for string values, those can be added to metrics through labels. For example:

nimbus_some_topic_metric{topic="discovery", network="testnet0", protocol="rlpx"} 666

And for anything else that is more verbose, like for example the Connecting to bootstrap nodes message which includes a bootstrapNodes field which contains an enode address, we can collect those via logs just fine. If the volume is reduced by properly exporting metrics as metrics we can do that too, without having to resort to ridiculous amounts of DEBUG logs.

jakubgs commented 5 years ago

BTW, I've disabled DEBUG logs on the nimbus.test fleet for now, because this has gotten out of control and Igor was swarmed by alerts about ES hosts failing most of the day due to this. I'm also cleaning up ES indices by removing older Nimbus logs.

arnetheduck commented 5 years ago

hm, yeah, cleanup ok, but there's utility for us to have the debug logs.. ie better to have debug logs that get cleaned up aggressively (weekly?) than have crippled logs forever.

This is not sustainable and we need a better solution for this.

this would be best tracked in the (public) nimbus repo I think - in particular so we can have a discussion about the shape of the support (push vs pull etc)

jakubgs commented 5 years ago

I've left most of April, but as I said, there is a difference between logs and metrics. I have no issue with keeping debug logs if they are really logs, not metrics disguised as logs. As I said in my other post:

And for anything else that is more verbose, like for example the Connecting to bootstrap nodes message which includes a bootstrapNodes field which contains an enode address, we can collect those via logs just fine. If the volume is reduced by properly exporting metrics as metrics we can do that too, without having to resort to ridiculous amounts of DEBUG logs.

If we remove majority of those logs by moving most of the stuff to a proper metrics setup then I see no issue with collecting real DEBUG logs. But there needs to be a clear distinction. Most of the stuff right now in the logs is just metrics.

arnetheduck commented 5 years ago

they're logs. we're extracting some from them but they're definitely not metrics. the data is needed to.. er.. debug things after the fect which is why we want the debug logs too.

jakubgs commented 5 years ago

What about messages like Unresolved block and Block received? Could they be turned into just metrics? I see they have string values like previous_block_root or state_root which appear to be addresses, but could those not be for example represented as a number relative to something?

If that was possible we could reduce the number of logs by ~70%

arnetheduck commented 5 years ago

These are important checkpoints along the block processing pipeline that we use to reconstruct what happened in a tricky scenario - the roots allow us to establish the chain of events and spot reorderings and the like..

eventually we'll go down to INF and better broadcasts, and a lot of it will go away (ie once we run this for real, in production)

jakubgs commented 5 years ago

Okay, I've added a 4th node to the ElasticSearch cluster: node-04.do-ams3.log-store.hq

This should lessen the load on the others when querying and reduce size of shards:

Changes:

Hopefully this will keep the ES cluster alive while I'm away during first week of May. If it's not enough we'll need to bump instance types on all 4 from s-2vcpu-4gb to s-4vcpu-8gb.

It might take a while for shards to start spreading but it should help: https://es.status.im/

jakubgs commented 5 years ago

With the help from @zah we now have proper Prometheus metrics configured in:

The metrics look like this:

 % curl -s "http://localhost:9200/metrics" | grep '^beacon_'
beacon_current_live_validators 0.0
beacon_current_live_validators_created 1572608831.0
beacon_previous_live_validators 0.0
beacon_previous_live_validators_created 1572608831.0
beacon_pending_deposits 0.0
beacon_pending_deposits_created 1572608831.0
beacon_processed_deposits_total 192.0
beacon_processed_deposits_total_created 1572608831.0
beacon_finalized_epoch 0.0
beacon_finalized_epoch_created 1572608831.0
beacon_finalized_root 0.0
beacon_finalized_root_created 1572608831.0
beacon_current_justified_epoch 0.0
beacon_current_justified_epoch_created 1572608831.0
beacon_current_justified_root 0.0
beacon_current_justified_root_created 1572608831.0
beacon_previous_justified_epoch 0.0
beacon_previous_justified_epoch_created 1572608831.0
beacon_previous_justified_root 0.0
beacon_previous_justified_root_created 1572608831.0
beacon_current_validators 192.0
beacon_current_validators_created 1572608831.0
beacon_previous_validators 192.0
beacon_previous_validators_created 1572608831.0
beacon_previous_epoch_orphaned_blocks 0.0
beacon_previous_epoch_orphaned_blocks_created 1572608831.0
beacon_reorgs_total_total 0.0
beacon_reorgs_total_created 1572608831.0
beacon_slot 8332.0
beacon_slot_created 1572608831.0
beacon_head_slot 8332.0
beacon_head_slot_created 1572608831.0
beacon_head_root 1.157048430285501e+17
beacon_head_root_created 1572608831.0
beacon_pending_exits 0.0
beacon_pending_exits_created 1572608831.0
beacon_attestations_sent_total 173.0
beacon_attestations_sent_created 1572608831.0
beacon_attestations_received_total 1659.0
beacon_attestations_received_created 1572608831.0
beacon_blocks_proposed_total 12.0
beacon_blocks_proposed_created 1572608831.0
beacon_blocks_received_total 836.0
beacon_blocks_received_created 1572608831.0

And now we have a proper metrics dashboard: https://grafana.status.im/d/lXQ3tP0Zz/nimbus-metrics Though it probably needs some work, as I don't really understand these metrics. Some descriptions would be nice too.