ticketmaster / lbapi

Ticketmaster Load Balancer API
MIT License
0 stars 1 forks source link

BUG-fetchall handler fatal error when unable to connect to netscaler #10

Open mrnetops opened 4 years ago

mrnetops commented 4 years ago

time="2020-10-22T05:00:31Z" level=fatal msg="timout connecting to 192.168.48.20" handler=fetchall route=loadbalancer user=foo

lbapi dies when this occurs and has to be restarted.

mrnetops commented 4 years ago

I'm guessing it's o.Log.Fatal(err) in sdk_fork New

func New(conf *SdkConf) *SdkFork {
        ////////////////////////////////////////////////////////////////////////////
        var err error
        ////////////////////////////////////////////////////////////////////////////
        o := &SdkFork{
                Virtualserver: virtualserver.New(),
                Loadbalancer:  loadbalancer.New(),
        }
        ////////////////////////////////////////////////////////////////////////////
        o.Target = conf.Target
        o.Log = conf.Log
        ////////////////////////////////////////////////////////////////////////////
        err = o.setConnection()
        if err != nil {
                o.Log.Fatal(err)
        }
        return o
}
mrnetops commented 4 years ago

Looks like we're peppered with a number of o.Log.Fatal(err) entries which largely appear to be stability timebombs that should be handled more gracefully.

./certificate/avi.go:           o.Log.Fatal(err)
./poolgroup/avi.go:         o.Log.Fatal(err)
./loadbalancer/netscaler_etl.go:            o.Log.Fatal(err)
./loadbalancer/netscaler_etl.go:        o.Log.Fatal(err)
./loadbalancer/netscaler.go:            o.Log.Fatal(err)
./loadbalancer/avi.go:          o.Log.Fatal(err)
./pool/avi.go:          o.Log.Fatal(err)
./sdkfork/sdk_fork.go:      o.Log.Fatal(err)
./virtualserver/avi.go:         o.Log.Fatal(err)
./monitor/netscaler.go:         o.Log.Fatal(err)
./monitor/netscaler.go:         o.Log.Fatal(err)
./monitor/avi.go:           o.Log.Fatal(err)
./persistence/avi.go:           o.Log.Fatal(err)
CarlosOVillanueva commented 4 years ago

Log into the lbapi server and run docker logs -f <lbapi instance>.

More than likely, there is a netscaler instance that is no longer live, and the system is trying to connect to it. It will show it in the logs. If that is the case, remove the netscaler and it's HA members from lbapi.

mrnetops commented 4 years ago

That's going to be my short term fix, but to be clear it is that's a short term fix to a long term problem where lbapi is vulnerable to dying from a variety of potentially transitory issues that need to be handled more gracefully.

Connection timeout should a warning and moving on, not killing the entire lbapi, as I imagine is the case for pretty much every o.Log.Fatal(err) entry outside of main.go

CarlosOVillanueva commented 4 years ago

The reasoning for leaving it as is was to raise an alarm if lbapi was unable to talk to a load balancer. This would prevent a client from attempting to build a virtual service on that unit and force the support team to look into why a load balancer was timing out or not available - if that makes sense. But to your point, I completely agree that there are better ways to handle this and it should not be a total panic.

I'll look into having go recover automatically following the exception or even possibly setting the docker restart policy to reload lbapi, following the exception. In either case though, there has to be some mechanism to alert the support team that lbapi cannot talk to the destination resource.

mrnetops commented 4 years ago

That's a fair point. Sounds a lot like needing a prometheus exporter + alertmanager ;)