status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

Missing logs from Nimbus ELK stack #81

Closed jakubgs closed 2 years ago

jakubgs commented 2 years ago

Some logs are missing when querying them in our Nimbus Kibana instance, specifically:

{"lvl":"DBG","ts":"2022-01-28 14:05:54.724+00:00","msg":"Received request","topics":"beacnde","peer":"82.139.21.242:49814","meth":"GET","uri :"/totally_a_test"}
{"lvl":"DBG","ts":"2022-01-28 14:05:54.724+00:00","msg":"Request is not part of API","topics":"beacnde","peer":"82.139.21.242:49814","meth":"GET","uri":"/totally_a_test"}

Querying for these returns nothing: https://nimbus-logs.infra.status.im/goto/05738788ae13e81a579cbcadc06e4cbb

jakubgs commented 2 years ago

While checking Logstash logs I found errors about inconsistent formatting of error field so I opened an issue: https://github.com/status-im/nimbus-eth2/issues/3335

But that's not what I'm looking for.

jakubgs commented 2 years ago

Cluster and indices appear healthy:

admin@node-01.aws-eu-central-1a.log-store.nimbus:~ % curl -s localhost:9200/_cluster/health | jq .status 
"green"
admin@node-01.aws-eu-central-1a.log-store.nimbus:~ % curl -s 'http://localhost:9200/_cat/indices?v&s=index' | tail
green  open   logstash-2022.01.19             vYMNhbaRS9ygpRrhKzLjjA   1   1   19435799            0     16.1gb          8.1gb
green  open   logstash-2022.01.20             s-VpsnK3RS6qaSo6to7rsQ   1   1   18472014            0     15.4gb          7.7gb
green  open   logstash-2022.01.21             q0GAiiq5RIySnRXKjEAYiw   1   1   16695781            0     13.9gb          6.9gb
green  open   logstash-2022.01.22             xoFARO4mT2CjcX7xO9eOEw   1   1   17044434            0     14.2gb          7.1gb
green  open   logstash-2022.01.23             g5CSCh2HRpav1ofEa2e2KA   1   1   19818594            0     16.5gb          8.2gb
green  open   logstash-2022.01.24             Fc7DZ8QTSfmFZ5bCq3rHpA   1   1   18575332            0     15.5gb          7.7gb
green  open   logstash-2022.01.25             zyppPhRKRbGOIJUhi_Y30w   1   1   17500493            0     14.6gb          7.3gb
green  open   logstash-2022.01.26             YG7U7HWeRxuVdQmiza3U2g   3   1    4832581            0      4.5gb          2.2gb
green  open   logstash-2022.01.27             ZRxXXqIHSb2HdPTfXs9Agg   3   1    4991335            0      4.4gb          2.2gb
green  open   logstash-2022.01.28             TupksO-URG-Ihv1MhE5cVw   3   1    3197084            0      3.1gb          1.3gb
jakubgs commented 2 years ago

Nodes look connected fine, including dashboard LB node

admin@node-01.aws-eu-central-1a.log-store.nimbus:~ % c localhost:9200/_cat/nodes    
10.13.0.38 44 97 2 0.10 0.08 0.08 cdfhilmrstw - node-01.aws-eu-central-1a.log-store.nimbus
10.13.0.45 41 97 1 3.48 3.29 4.16 cdfhilmrstw * node-02.aws-eu-central-1a.log-store.nimbus
10.13.0.13 32 95 1 0.08 0.06 0.01 -           - node-01.aws-eu-central-1a.dash.nimbus
10.13.0.47 54 97 2 4.58 3.50 2.38 cdfhilmrstw - node-03.aws-eu-central-1a.log-store.nimbus
jakubgs commented 2 years ago

In terms of resources the ES cluster looks fine:

image

jakubgs commented 2 years ago

Kibana sees all indices:

image

jakubgs commented 2 years ago

Found the issue:

image

Dark mode was off.

jakubgs commented 2 years ago

Flow of logs appears to be steady:

image

jakubgs commented 2 years ago

Another logging issue found: https://github.com/status-im/nimbus-eth2/issues/3336

jakubgs commented 2 years ago

I've been trying to check what messages arrive in Logstash from metal-01.he-eu-hel1.nimbus.mainnet by adding this:

     if [logsource_ip] == "95.217.87.121" {¬
       file {¬
         path => "/usr/share/logstash/config/debug.log"¬
         codec => rubydebug¬
       }¬
     }¬

https://github.com/status-im/infra-hq/blob/78341683/ansible/roles/logstash/templates/pipe/52-filter-beacon-node.conf#L2

Within the if that handles beacon-node log output, but the debug.log file isn't even created. I'm baffled.

jakubgs commented 2 years ago

This definitely should work, since this basic config works:

input {
  file {
    path => "/home/jakubgs/input.log"
  }
}
filter {
  grok {
    match => {
      "message" => "%{WORD:extra}"
    }
  }
}
output { 
  file {
    path => "/home/jakubgs/output.log"
    codec => rubydebug
  }
}
 > cat output.log       
{
       "message" => "wat",
         "extra" => "wat",
          "path" => "/home/jakubgs/input.log",
      "@version" => "1",
    "@timestamp" => 2022-01-28T18:23:48.415Z,
          "host" => "caspair"
}

Maybe the issue are permissions...

jakubgs commented 2 years ago

Yup, found it!

[INFO ][logstash.outputs.file    ] Opening file {:path=>"/usr/share/logstash/config/debug.log"}
[WARN ][logstash.javapipeline    ] Waiting for input plugin to close {:pipeline_id=>"main", :thread=>"#<Thread:0x6231629 run>"}
[ERROR][logstash.javapipeline    ] Pipeline worker error, the pipeline will be stopped {:pipeline_id=>"main", :error=>"(EACCES) Permission denied - /usr/share/logstash/config/debug.log", :exception=>Java::OrgJrubyExceptions::SystemCallError, ...
Permission denied - /usr/share/logstash/config/debug.log
jakubgs commented 2 years ago

What the fuck is happening. I have a clearly defined file output and yet it doesn't show up in initialization logs:

[ERROR][logstash.agent           ] API HTTP Request {:status=>500, :request_method=>"GET", :path_info=>"/_node/stats", :query_string=>"", :http_version=>"HTTP/1.1", :http_accept=>nil}
[INFO ][org.reflections.Reflections] Reflections took 805 ms to scan 1 urls, producing 120 keys and 417 values 
[INFO ][logstash.outputs.elasticsearch] New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//10.10.0.20:9200", "//10.10.0.21:9200", "//10.10.0.13:9200", "//10.10.0.15:9200"]}
[INFO ][logstash.outputs.elasticsearch] Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://10.10.0.20:9200/, http://10.10.0.21:9200/, http://10.10.0.13:9200/, http://10.10.0.15:9200/]}}
[WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://10.10.0.20:9200/"}
[INFO ][logstash.outputs.elasticsearch] Elasticsearch version determined (7.14.2) {:es_version=>7}
[WARN ][logstash.outputs.elasticsearch] Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>7}
[WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://10.10.0.21:9200/"}
[WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://10.10.0.13:9200/"}
[WARN ][logstash.outputs.elasticsearch] Restored connection to ES instance {:url=>"http://10.10.0.15:9200/"}
[INFO ][logstash.outputs.elasticsearch] Using a default mapping template {:es_version=>7, :ecs_compatibility=>:disabled}

This makes no fucking sense.

jakubgs commented 2 years ago

If I drop all other outputs it loads fine:

[INFO ][logstash.outputs.file    ] Opening file {:path=>"/debug/debug.log"}

But why?! If there's a syntax issue it should just fail to start.

jakubgs commented 2 years ago

Ok, so it appears Logstash creates the output only when the first rule matching the if clause hits.

And because I restarted Logstash for some reason it stopped receiving logs from our nimbus hosts, unless I restart rsyslog:

[INFO ][logstash.outputs.file    ] Opening file {:path=>"/debug/debug.log"}

And then it works...

jakubgs commented 2 years ago

I made simplest config possible consisting of one file:

input {
  udp {
    port => 5141
    codec => "json"
    type => "rsyslog"
    buffer_size => 262144
  }
}
output {
  file {
    path => "/debug/wtf.log"
    codec => json
  }
}

And it doesn't show up in the file. What the actual fuck.

This means the issue has to be on the rsyslog side.

jakubgs commented 2 years ago

There's a whole stream of forks of rsyslog_exporter that provides Prometheus metrics for rsyslog:

The last one is the newest and also has a release: https://github.com/aleroyer/rsyslog_exporter/releases/tag/v0.1.0

jakubgs commented 2 years ago

You can get some stuff from it:

admin@metal-01.he-eu-hel1.nimbus.mainnet:~ % curl -s localhost:9104/metrics | grep -E '^rsyslog+.*[1-9]$'
rsyslog_action_processed{action="action-0-mmutf8fix"} 5.212364e+06
rsyslog_action_processed{action="action-2-builtin:omfwd"} 870284
rsyslog_action_processed{action="action-5-builtin:omfile"} 18
rsyslog_action_processed{action="action-6-builtin:omfile"} 5.212346e+06
rsyslog_action_processed{action="to_exporter"} 896
rsyslog_input_submitted{input="imuxsock"} 5.212364e+06
rsyslog_queue_enqueued{queue="main Q"} 5.213264e+06
rsyslog_queue_max_size{queue="main Q"} 751
rsyslog_queue_size{queue="main Q"} 4
rsyslog_resource_inblock{resource="resource-usage"} 176
rsyslog_resource_majflt{resource="resource-usage"} 5
rsyslog_resource_maxrss{resource="resource-usage"} 7396
rsyslog_resource_nivcsw{resource="resource-usage"} 12557
rsyslog_resource_nvcsw{resource="resource-usage"} 9.908851e+06
rsyslog_resource_oublock{resource="resource-usage"} 4.440904e+06
rsyslog_resource_stime{resource="resource-usage"} 5.4581416e+07
rsyslog_resource_utime{resource="resource-usage"} 3.8644998e+07

Most metrics are 0 though. The rsyslog_queue_size is quite small.

jakubgs commented 2 years ago

Based on some stuff I've read about tuning rsyslog outputs from:

I've tuned the Logstash output action configuration: https://github.com/status-im/infra-role-bootstrap-linux/commit/87954443

   # Tuning
  queue.size="250000"
  queue.dequeueBatchSize="4096"
  queue.workerThreads="4"
  queue.workerThreadMinimumMessages="60000"

https://github.com/status-im/infra-role-bootstrap-linux/blob/18f07297/templates/rsyslog/30-logstash.conf#L27-L39

jakubgs commented 2 years ago

I started a local instance of Logstash just to debug that one node, but for some reason I can't get it to open a connection to it:

admin@metal-01.he-eu-hel1.nimbus.mainnet:~ % sudo conntrack -L -p udp --dport 5141                  
conntrack v1.4.5 (conntrack-tools): 0 flow entries have been shown.
jakubgs commented 2 years ago

Looks like we're not getting logs from all Nimbus hosts. Not sure what that's about, will figure it out tomorrow.

jakubgs commented 2 years ago

I found out why logs are missing. My own mistake when adjusting journald config.

Reverted: https://github.com/status-im/infra-role-bootstrap-linux/commit/2da26200

jakubgs commented 2 years ago

For some reason changes to /etc/systemd/journald.conf do not have any effect until reboot, restarting service does nothing. I also tried systemctl force-reload systemd-journald but it doesn't work. Makes no sense to me.

stefantalpalaru commented 2 years ago

"making a change to any of journald's configuration files (/etc/systemd/journald.conf, for example) requires subsequently running systemctl restart systemd-journald in order for the changes to take effect. However, restarting the journald service appears to break logging for all other running services until those services are also restarted" - https://github.com/systemd/systemd/issues/2236

:-)

jakubgs commented 2 years ago

Now isn't that lovely.

jakubgs commented 2 years ago

I think we have a much bigger problem than just some logs missing. The MAJORITY of logs are not stored in ES cluster:

admin@metal-01.he-eu-hel1.nimbus.mainnet:~ % zcat /var/log/service/beacon-node-mainnet-unstable-02/service.20220202_1400.log.gz | wc -l
6185047

That's 6 million entries for just one hour of one node on one host, and here's the size of the index from today:

admin@node-01.aws-eu-central-1a.dash.nimbus:~ % c 'localhost:9200/logstash-2022.02.02/_count' | jq .count
5257709

The index for today's whole day has less logs than one hour of one node on one host...

jakubgs commented 2 years ago

Ok, I think I can see the issue:

image

We're simply hitting the 1GBit/s limit of the link installed in our Hetzner hosts: https://www.hetzner.com/dedicated-rootserver/ax41-nvme

admin@metal-01.he-eu-hel1.nimbus.mainnet:~ % sudo ethtool enp35s0 | grep -A2 'Supported link modes'
    Supported link modes:   10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Full 
jakubgs commented 2 years ago

If I look at traffic sent via rsyslog to node-01.he-eu-hel1.log-aggr.hq using iftop we can see it's ~70Mb/s:

image

Other hosts show something similar, which from a rought calculation gives us: 16*70 = 980Mb/s So we might be slightly over limit, or quite a bit, either way we need to either get more bandwidth, or start compressing.

jakubgs commented 2 years ago

The other option is to try to get a 10Gb/s link for the Hetzner host:

image image

https://docs.hetzner.com/robot/dedicated-server/general-information/root-server-hardware/#Miscellaneous

Though for that to work we'd also have to move our ElasticSearch cluster to Hetzner and put it under the same switch.

jakubgs commented 2 years ago

So I see two options:

Not sure about the viability with the 5-port host, I'd have to ask Hetzner support.

stefantalpalaru commented 2 years ago

Those logs are highly compressible. Can we send them compressed or re-compress them in flight?

jakubgs commented 2 years ago

Yes, that's what I already suggested in https://github.com/status-im/infra-nimbus/issues/81#issuecomment-1028049121, but there's no ready-made solution for compressing logs sent to Logstash: https://www.elastic.co/guide/en/logstash/current/input-plugins.html

So it would require some kind of custom solution with some intermediate service that would do the compression of the traffic, and also would require optimized buffering for the compression to be actually effective.

jakubgs commented 2 years ago

Wow, I stopped logging from just one host - metal-06.he-eu-hel1.nimbus.mainnet - and this is the result:

image

I did not expect that kind of big change from one host, but it does match with what iftop says:

image

jakubgs commented 2 years ago

In comparison a metal-01.he-eu-hel1.nimbus.prater.wg host generates much less traffic:

image

I guess this is related to the number of peers in the network. Or something like that @arnetheduck ?

arnetheduck commented 2 years ago

a combination of validators and peers

jakubgs commented 2 years ago

Very considerable difference, I'd say mainnet is responsible for more than half of the logs:

image

I'm going to leave it like that - with logging to ELK for 4 mainnet hosts off - until I resolve this issue. It would be better to limit these logs based on log level from Nimbus, but will require additional rsyslog wizardry.

jakubgs commented 2 years ago

Looks like replacing the cluster with one hosted on Hetzner metal nodes did help:

image

https://nimbus-logs.infra.status.im/goto/ba7674a1469e911b2fafab3573eeee4e

jakubgs commented 2 years ago

Now the Hetzner hosts are receiving around ~170 MB/s in logs from aggregation hosts:

image

Today's index is ALREADY 22 GB in size(11GB per host):

admin@store-01.he-eu-hel1.logs.nimbus:~ % c localhost:9200/_cat/indices\?s=index | grep logstash
green open logstash-2022.02.03             UkvhMwlPT96fkZdqmyl8uw 3 1 26291030    0  22.2gb  11.2gb

Not sure how sustainable this is... this means 24h of logs is going to be ~240 GB, and those host have only ~400 GB of storage.

We will have to add some rules to logstash to possible cut out the most verbose and useless logs.

jakubgs commented 2 years ago

The Attestation received accounts for accounts for vast majority of logs, followed by Aggregate received:

image

If we could cut that one out we'd save a LOT of space and also make queries much faster.

@arnetheduck thoughts?

arnetheduck commented 2 years ago

ok to filter them from elk

jakubgs commented 2 years ago

Yep, the size is ridiculous, I will apply the rules:

admin@store-01.he-eu-hel1.logs.nimbus:~ % c localhost:9200/_cat/indices/logstash\*\?s=index\&v  
health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   logstash-2022.02.03 UkvhMwlPT96fkZdqmyl8uw   3   1   47747875            0     39.5gb         19.7gb
green  open   logstash-2022.02.04 KyE0okHvTR-xwScV673Sjg   3   1  156849062            0    132.9gb         66.5gb
jakubgs commented 2 years ago

The change - https://github.com/status-im/infra-hq/commit/d699e893 - results in about 96% reduction in log volume:

image

The step is because I first deployed to node-01.he-eu-hel1.log-aggr.hq, then to the rest of log aggregation hosts.

arnetheduck commented 2 years ago

lgtm, should also speed up searches considerably

jakubgs commented 2 years ago

Yeah, much better:

image

Especially now that the hosts use NVMe drives and have 64GB of ram, which does speed up queries and indexing a LOT.

jakubgs commented 2 years ago

Looks like we have a new issue on the ES cluster:

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException:
[parent] Data too large, data for [internal:cluster/nodes/indices/shard/store[n]] would be [36997638608/34.4gb],
which is larger than the limit of [36795685273/34.2gb], real usage: [36997638456/34.4gb], new bytes reserved:
[152/152b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=152/152b, model_inference=0/0b, eql_sequence=0/0b, accounting=3913236/3.7mb]
jakubgs commented 2 years ago

I can't even list existing indices:

admin@store-01.he-eu-hel1.logs.nimbus:~ % c localhost:9200/_cat/indices/logstash\*\?s=index\&v
curl: (22) The requested URL returned error: 429 Too Many Requests
jakubgs commented 2 years ago

I bumped the JVM memory limits to 80% of what system has: https://github.com/status-im/infra-nimbus/commit/ba2caedb

jakubgs commented 2 years ago

Looks like we are still getting logs that I've excluded in:

    if [json.msg] in ["Attestation resolved", "Attestation received", "Aggregate received"] {
      drop {}
    }

image

So there's something not right with that if condition.

jakubgs commented 2 years ago

Oh, i see it, I used the old format for accessing nested fields, fixed: https://github.com/status-im/infra-hq/commit/b2505364

jakubgs commented 2 years ago

I've also lowered number of indices(days) kept to 10 to avoid running out of space: https://github.com/status-im/infra-nimbus/commit/fb7bde23

jakubgs commented 2 years ago

Interestingly enough I've noticed a reduction in disk space used after an index for the new day was created:

admin@store-01.he-eu-hel1.logs.nimbus:~ % sudo du -hsc /docker/elasticsearch/data/nodes/0/indices  
17G /docker/elasticsearch/data/nodes/0/indices
17G total
admin@store-01.he-eu-hel1.logs.nimbus:~ % sudo du -hsc /docker/elasticsearch/data/nodes/0/indices
13G /docker/elasticsearch/data/nodes/0/indices
13G total

My best bet is that this is due to ElasticSearch compression taking effect.

Possibly we could get an even better result by adjusting index.codec to best_compression: https://www.elastic.co/blog/save-space-and-money-with-improved-storage-efficiency-in-elasticsearch-7-10

index.codec

The default value compresses stored data with LZ4 compression, but this can be set to best_compression which uses DEFLATE for a higher compression ratio, at the expense of slower stored fields performance. If you are updating the compression type, the new one will be applied after segments are merged. Segment merging can be forced using force merge.

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/index-modules.html