google / certificate-transparency

Auditing for TLS certificates.
https://certificate.transparency.dev
Apache License 2.0
869 stars 283 forks source link

ct-server stability during election #1415

Closed grandamp closed 7 years ago

grandamp commented 7 years ago

Hello,

Similar to issue #811, we are seeing a failure of the ct-server on random instances about once a week.

The following entry was in the ct-server.FATAL log:

Log file created at: 2017/09/05 16:43:39
Running on machine: ip-172-31-11-120
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0905 16:43:39.573050  1252 masterelection.cc:362] Check failed: task->status().ok() /root/election/7a1ff88f-7cfb-4105-a657-38f2b1f5df6b: FAILED_PRECONDITION: Invalid JSON: Couldn't find 'node'

However, the etcd cluster appears to be completely healthy:

./etcdctl cluster-health
member 14f4d580c02ab5c5 is healthy: got healthy result from http://172.31.0.101:4001
member 40b2ec55845b6dec is healthy: got healthy result from http://172.31.0.103:4001
member dcbb94f558f01d98 is healthy: got healthy result from http://172.31.0.102:4001
cluster is healthy

Below is some more information regarding version(s)

etcd & ct-server instance OS & version:

Ubuntu 16.04 LTS (AWS)

etcd version info:

2017-09-05 18:37:18.177416 I | etcdmain: etcd Version: 3.2.2
2017-09-05 18:37:18.177428 I | etcdmain: Git SHA: cb2a496
2017-09-05 18:37:18.177439 I | etcdmain: Go Version: go1.8.3
2017-09-05 18:37:18.177451 I | etcdmain: Go OS/Arch: linux/amd64

ct-server version info:

ct-server version 66796fee67cc7785ea653bda4713496b968c24a4
Debug build (NDEBUG not #defined)

Like issue #811, is the best approach to extend the etcd timeout(s)?

grandamp commented 7 years ago

Here are the logs. WARNING and INFO logs truncated to the last 10k lines of each file.

ct-server_logs_truncated_20170905.tar.gz

grandamp commented 7 years ago

I wanted to follow-up, and close this issue, as we have been stable for a few months. We updated our ct-server instances using #1412, and merged #1417. Further, the following is the startup script we updated for our ct-server instances:

#!/bin/bash
CTLOGHOST="`hostname -f`"
/bin/echo "Server hostname is ${CTLOGHOST}"
ETCD_SERVERS="etcd1.internal:4001,etcd2.internal:4001,etcd3.internal:4001"
/bin/echo "ETCD Cluster is ${ETCD_SERVERS}"
/bin/echo "Deleting prior log data and logs"
/bin/rm /opt/ct-log/logs/*
/bin/rm /opt/ct-log/data/log.ldb/*
cd /usr/ctlog/opts
ulimit -c unlimited
/usr/ctlog/server/ct-server \
        --port=80 \
        --server=${CTLOGHOST} \
        --key=ct-server-key.pem \
        --trusted_cert_file=ca-roots.pem \
        --log_dir=/opt/ct-log/logs \
        --tree_signing_frequency_seconds=30 \
        --guard_window_seconds=10 \
        --leveldb_db=/opt/ct-log/data/log.ldb \
        --etcd_servers=${ETCD_SERVERS} \
        --etcd_delete_concurrency=100 \
        --num_http_server_threads=16 \
        --etcd_connection_timeout_seconds=30 \
        --node_state_ttl_seconds=900 \
        --master_keepalive_interval_seconds=240 \
        --monitoring=prometheus \
        --v=0
        &