peterducai / openshift-etcd-suite

tools to troubleshoot ETCD on Openshift 4
GNU General Public License v3.0
25 stars 18 forks source link

"Server is likely overloaded " filter bug returns false positive #15

Open nstamatelopoulos opened 1 year ago

nstamatelopoulos commented 1 year ago

Hello

I noticed that in the below function, grep command is checking for the word 'overload'.

etcd_overload() {
    OVERLOAD=$(cat $1/etcd/etcd/logs/current.log|grep 'overload'|wc -l)          <------------------------------------
    LAST=$(cat $1/etcd/etcd/logs/current.log|grep 'overload'|tail -1)       <----------------------------------------
    LOGEND=$(cat $1/etcd/etcd/logs/current.log|tail -1)
    if [ "$OVERLOAD" != "0" ]; then
      echo -e "${RED}[WARNING]${NONE} we found $OVERLOAD 'server is likely overloaded' messages in $1"
      echo -e "Last occurrence:"
      echo -e "$LAST"| cut -d " " -f1
      echo -e "Log ends at "
      echo -e "$LOGEND"| cut -d " " -f1
      echo -e ""
      OVRL=$(($OVRL+$OVERLOAD))
    # else
    #   echo -e "${GREEN}[OK]${NONE} zero messages in $1"
    fi

In my opinion this is not correct as in etcd there is another message that i think was added recently in the etcd container logs and is regarding the network load.

"dropped internal Raft message since sending buffer is full (overloaded network)"

This has as a result to return a false positive that server is likely overloaded but the real error is regarding the network overload.

I think we should change this to something like

OVERLOAD=$(cat $1/etcd/etcd/logs/current.log|grep 'server is likely overloaded'|wc -l)

And to introduce another function with the same code (I don't think that needs to be changed) for the network overloaded messages.

Let me know your thoughs

pbertera commented 1 year ago

Another false positive is about this message leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk