status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

Logrotate hourly and keep 72 hours of logs #26

Closed arnetheduck closed 3 years ago

arnetheduck commented 3 years ago

Currently running on daily rotation, 5 days:

jacek@devel-large-02.aws-eu-central-1a.nimbus.pyrmont:/data/log/beacon-node-pyrmont-devel$ ls -lah
total 23G
drwxr-xr-x 2 syslog syslog 4.0K Nov 20 00:03 .
drwxr-xr-x 3 syslog adm    4.0K Nov 17 15:20 ..
-rw-r--r-- 1 syslog adm    5.4G Nov 20 13:15 docker.log
-rw-r--r-- 1 syslog adm     17G Nov 20 00:08 docker.log.1
-rw-r--r-- 1 syslog adm    569M Nov 19 00:02 docker.log.2.gz

Ideally, also get rid of prefix (Nov 20 10:11:30 pyrmont-06.aws-eu-central-1a.nimbus.test docker/beacon-node-pyrmont-devel[600]:), it makes the logs non-json, hard to read and unnecessarily large

arnetheduck commented 3 years ago

Can also disable writing to systemd journal, there's a lot of logs and one copy is enough

jakubgs commented 3 years ago

We apparently also push same logs to Systemd journal, or at least rsyslog seems to.

jakubgs commented 3 years ago

As far as I can tell logs appear in systemd journal because dockerd prints them:

admin@devel-small-01.aws-eu-central-1a.nimbus.pyrmont:~ % sudo journalctl -u docker -n 1 --no-pager
-- Logs begin at Tue 2020-11-24 09:12:29 UTC, end at Tue 2020-11-24 09:54:03 UTC. --
Nov 24 09:54:03 devel-small-01.aws-eu-central-1a.nimbus.pyrmont docker/beacon-node-pyrmont-devel-small[583]: {"lvl":"DBG","ts":"2020-11-24 09:54:03.157+00:00","msg":"Attestation received","tid":1,"file":"eth2_processor.nim:313","delay":{"value":32157052613},"wallSlot":42569,"attestation":{"aggregation_bits":"0x0000000000200000000000000000000001","data":{"slot":42567,"index":22,"beacon_block_root":"16986f5e","source":{"epoch":1329,"root":"7d33851c"},"target":{"epoch":1330,"root":"25a5dd27"}},"signature":"b82b4eb7"},"committeeIndex":5}

So the issue isn't with configuration of rsyslog.

jakubgs commented 3 years ago

I don't really see what can be done to stop it from doing that:

 > dockerd -h | grep log
      --log-driver string                       Default driver for container logs (default "json-file")
  -l, --log-level string                        Set the logging level ("debug"|"info"|"warn"|"error"|"fatal") (default
      --log-opt map                             Default log driver options for containers (default map[])
      --raw-logs                                Full timestamps without ANSI coloring
admin@devel-small-01.aws-eu-central-1a.nimbus.pyrmont:~ % 

Changing log level doesn't seem to change anything.

jakubgs commented 3 years ago

Opened and issue, let's see what they say: https://github.com/docker/for-linux/issues/1151

arnetheduck commented 3 years ago

One more thing - around midnight right now, there's a huge processing spike that disturbs NBC - https://metrics.status.im/d/pgeNfj2Wz23/nimbus-fleet-testnets?orgId=1&from=1606330725473&to=1606362552381&var-instance=devel-large-02.aws-eu-central-1a.nimbus.pyrmont - I'm guessing this might be the log compression - if it's not done already, the compaction should probably be nice:d

jakubgs commented 3 years ago

I have identified why the logs also go to systemd journal: https://github.com/docker/for-linux/issues/1151#issuecomment-734351866 Source: https://www.rsyslog.com/doc/master/configuration/modules/imuxsock.html?highlight=socket#coexistence-with-systemd

Long story short, systemd journal controls /dev/log socket and also pushes logs to /run/systemd/journal/syslog from which Rsyslog reads, which is why logs first appear in the journal.

A solution could be specifying syslog-address to point to unixgram:///run/systemd/journal/syslog.

jakubgs commented 3 years ago

I have implemented sending logs by default to rsyslog in https://github.com/status-im/infra-role-bootstrap/commit/2356b0e4.

I've deployed the change to all Nimbus nodes. This should severely lower the frequency of log compression by systemd. Important thing to note is that for the container to be affected by this Docker config change it needs to be re-created. I have yet to verify if Watchtower image updates will make this take effect, but they probably will.

jakubgs commented 3 years ago

I confirmed on devel-large-01.aws-eu-central-1a.nimbus.pyrmont that Watchtower image update does re-create the container in such a way as is necessary for the syslog-address configuration to take effect.

jakubgs commented 3 years ago

So we currently already have log rotation set to hourly: https://github.com/status-im/infra-nimbus/blob/176d843a4002155c171fe19815252b5a4300d216/ansible/group_vars/all.yml#L19 But this does not seem to have effect. By default logrotate is ran daily via /etc/cron.daily/logrotate:

 > grep /usr/sbin/logrotate /etc/cron.daily/logrotate
if [ ! -x /usr/sbin/logrotate ]; then
/usr/sbin/logrotate /etc/logrotate.conf

Which we copy into /etc/cron.hourly here: https://github.com/status-im/infra-role-bootstrap/blob/2356b0e4/tasks/logging/main.yml#L50-L57 Which does work:

admin@devel-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % ls -l /etc/cron.hourly/logrotate 
-rwxr-xr-x 1 root root 377 Jan 21  2019 /etc/cron.hourly/logrotate
jakubgs commented 3 years ago

And we can see that it should run when tested with run-parts:

admin@devel-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % sudo run-parts --report --test /etc/cron.hourly 
/etc/cron.hourly/logrotate

So for all intents and purposes it should run hourly.

jakubgs commented 3 years ago

Oooooooooooooooooh:

admin@devel-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % head -n6 /etc/cron.hourly/logrotate
#!/bin/sh

# skip in favour of systemd timer
if [ -d /run/systemd/system ]; then
    exit 0
fi
admin@devel-large-01.aws-eu-central-1a.nimbus.pyrmont:~ % ls -l /run/systemd/system
total 4
-rw-r--r-- 1 root root 299 Nov 20 13:22 netplan-ovs-cleanup.service
drwxr-xr-x 2 root root  60 Nov 20 13:22 systemd-networkd.service.wants

So it looks like it gets skipped because this exists:

 % s list-timers -a logrotate
NEXT                        LEFT    LAST                        PASSED  UNIT            ACTIVATES        
Fri 2020-11-27 00:00:00 UTC 7h left Thu 2020-11-26 00:00:03 UTC 16h ago logrotate.timer logrotate.servic
 % s status logrotate.timer
● logrotate.timer - Daily rotation of log files
     Loaded: loaded (/lib/systemd/system/logrotate.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Fri 2020-11-20 12:55:03 UTC; 6 days ago
    Trigger: Fri 2020-11-27 00:00:00 UTC; 7h left
   Triggers: ● logrotate.service
       Docs: man:logrotate(8)
             man:logrotate.conf(5)

And the config is...

[Unit]
Description=Daily rotation of log files
Documentation=man:logrotate(8) man:logrotate.conf(5)

[Timer]
OnCalendar=daily
AccuracySec=12h
Persistent=true

[Install]
WantedBy=timers.target

Daily...

Which means that that copy of script into /etc/cron.hourly has no effect.

jakubgs commented 3 years ago

I made it so that /lib/systemd/system/logrotate.timer is modified if we use a custom log rotation frequency: https://github.com/status-im/infra-role-bootstrap/commit/b9ab08df

jakubgs commented 3 years ago

I also bumped limit of files to keep to 24: https://github.com/status-im/infra-nimbus/commit/796c7ce2

jakubgs commented 3 years ago

I did a test by defining the simplest Rsyslog template possible:

$template rawMsg,"%msg%\n"

But when I used it I'm getting a single space in front of the log:

% tail -n1 /var/log/docker/beacon-node-mainnet-master-small/docker.log
 {"lvl":"DBG","ts":"2020-11-26 17:40:25.988+00:00","msg":"Closed stream","topics":"lpstream","tid":1,"file":"lpstream.nim:257","s":"5fbfe889d1359f4f12fa4a31","objName":"LPChannel","dir":"In"}
jakubgs commented 3 years ago

Oh, that's weird but it needs to be:

$template rawMsg,"%msg:2:2048%\n"

Based on:

it is recommended to use: $template myFormat,"%msg:2:2048%\n" https://stackoverflow.com/a/60174111

And that fixes it:

% tail -n1 /var/log/docker/beacon-node-mainnet-master-small/docker.log
{"lvl":"DBG","ts":"2020-11-26 17:43:37.513+00:00","msg":"Stream created","topics":"lpstream","tid":1,"file":"lpstream.nim:128","s":"5fbfe949d1359f4f12fa5374","objName":"ChronosStream","dir":"Out"}
jakubgs commented 3 years ago

Default rsyslog tempaltes can be found here: https://www.rsyslog.com/doc/v8-stable/configuration/templates.html?highlight=rsyslog_traditionalfileformat#reserved-template-names The one v8 uses by default is RSYSLOG_TraditionalFileFormat.

jakubgs commented 3 years ago

I made format customizable in https://github.com/status-im/infra-role-bootstrap/commit/1de7aaac and set it to %msg:2:2048%\n in https://github.com/status-im/infra-nimbus/commit/996e2df2.

jakubgs commented 3 years ago

Looking good:

admin@master-small-02.aws-eu-central-1a.nimbus.mainnet:~ % tail -n5 /var/log/docker/beacon-node-mainnet-master-small/docker.log
{"lvl":"DBG","ts":"2020-11-26 18:08:31.845+00:00","msg":"Peer disconnected","topics":"networking","tid":1,"file":"eth2_network.nim:925","peer":"16Uiu2HAmQP7HztnkrmRwa28U1T7BNhXRkoHLRitt12y9JuUchjr2","connections":0}
{"lvl":"DBG","ts":"2020-11-26 18:08:31.845+00:00","msg":"unsubscribing pubsub peer","topics":"pubsub","tid":1,"file":"pubsub.nim:80","peerId":"16*Uchjr2"}
{"lvl":"DBG","ts":"2020-11-26 18:08:31.886+00:00","msg":"Peer status","topics":"sync","tid":1,"file":"sync_protocol.nim:208","peer":"16*r3QFJb","statusMsg":{"forkDigest":"b5303f2a","finalizedRoot":"00000000","finalizedEpoch":0,"headRoot":"4d611d5b","headSlot":0}}
{"lvl":"DBG","ts":"2020-11-26 18:08:31.886+00:00","msg":"Closed stream","topics":"lpstream","tid":1,"file":"lpstream.nim:257","s":"5fbfef1fd1359f4f12fa5a0a","objName":"LPChannel","dir":"In"}
{"lvl":"DBG","ts":"2020-11-26 18:08:31.957+00:00","msg":"Stream created","topics":"lpstream","tid":1,"file":"lpstream.nim:128","s":"5fbfef1fd1359f4f12fa5a0d","objName":"SecureConn","dir":"Out"}
arnetheduck commented 3 years ago

nice! look at that storage graph as well, nuts: https://grafana.infra.status.im/d/QCTZ8-Vmk/single-host-dashboard?viewPanel=22&orgId=1&from=now-24h&to=now&refresh=10s&var-host=libp2p-small-01.aws-eu-central-1a.nimbus.pyrmont

jakubgs commented 3 years ago

Yeah, I had no idea this stuff was going through systemd. You learn something new every day and yet you still die stupid.

I guess it's done.

arnetheduck commented 3 years ago

I also bumped limit of files to keep to 24: 796c7ce

can we get 72 here? ie 3 days? assuming it's really hourly.

jakubgs commented 3 years ago

Looks hourly so far:

admin@master-small-02.aws-eu-central-1a.nimbus.mainnet:/docker/log/beacon-node-mainnet-master-small % ll
total 114M
-rw-r--r-- 1 syslog adm  16M Nov 26 20:27 docker.log
-rw-r--r-- 1 syslog adm  39M Nov 26 20:00 docker.log.1
-rw-r--r-- 1 syslog adm 2.5M Nov 26 19:00 docker.log.2.gz
-rw-r--r-- 1 syslog adm 1.5M Nov 26 18:00 docker.log.3.gz
-rw-r--r-- 1 syslog adm  56M Nov 26 17:24 docker.log.4.gz

can we get 72 here?

Sure we can: https://github.com/status-im/infra-nimbus/commit/65f0d944

jakubgs commented 3 years ago

For the sake of posterity: disk_operations_after_log_fix