nokia / danm

TelCo grade network management in a Kubernetes cluster
BSD 3-Clause "New" or "Revised" License
373 stars 81 forks source link

Svcwatcher core after losing master/leader #227

Open TothFerenc opened 4 years ago

TothFerenc commented 4 years ago

Is this a BUG REPORT or FEATURE REQUEST?: bug

What happened: Svcwatcher Pod lost master for any reason, so the process was exiting:

E0704 17:44:12.997128       1 svcwatcher.go:93] Lost master
F0704 17:44:12.997152       1 svcwatcher.go:97] Lost lease
E0704 17:44:12.997232       1 event.go:269] Unable to write event: 'can't create an event with namespace 'default' in namespace 'kube-system'' (may retry after sleeping)
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc000374400, 0xc00028e000, 0x3b, 0x9e)
        /go/src/github.com/nokia/danm/vendor/github.com/golang/glog/glog.go:769 +0xb8
github.com/golang/glog.(*loggingT).output(0x20605c0, 0xc000000003, 0xc00023c0e0, 0x1fce7c9, 0xd, 0x61, 0x0)
        /go/src/github.com/nokia/danm/vendor/github.com/golang/glog/glog.go:720 +0x372
github.com/golang/glog.(*loggingT).println(0x20605c0, 0xc000000003, 0xc00002feb0, 0x1, 0x1)
        /go/src/github.com/nokia/danm/vendor/github.com/golang/glog/glog.go:633 +0xe7
github.com/golang/glog.Fatalln(...)
        /go/src/github.com/nokia/danm/vendor/github.com/golang/glog/glog.go:1141
main.main()
        /go/src/github.com/nokia/danm/cmd/svcwatcher/svcwatcher.go:97 +0x9e4
E0704 17:44:20.320002       1 event.go:269] Unable to write event: 'can't create an event with namespace 'default' in namespace 'kube-system'' (may retry after sleeping)
glog: Flush took longer than 10s

What you expected to happen: No core dump before exit.

How to reproduce it: It happens frequently during deployment.

Anything else we need to know?:

Environment:

Levovar commented 4 years ago

so, this is the 97th line where it cores: https://github.com/nokia/danm/blob/master/cmd/svcwatcher/svcwatcher.go#L97 It is literally a library call without references to any objects I think I have already stated earlier that glog is shite :) maybe the non-newline API wouldn't core, but I absolutely refuse to deep dive into its code. solution is removing the usage of the whole library

the cannot create event remark above is more interesting for me

Levovar commented 4 years ago

reg the Eventing issue: the leader election library creates an event recorder without a namespace defined, so it defaults to default but our component runs in the kube-system, so when we really want to record an event it fails something like: https://github.com/tsuru/remesher/pull/5

which is funny because as far as I can tell the Events are raised using the meta of the provided EndPointsLock: https://github.com/kubernetes/client-go/blob/00dbcca6ee44c678754d3f5fda1bd0e704b26fe2/tools/leaderelection/resourcelock/endpointslock.go#L100, and lo and behold we do set the proper namespace into the lock: https://github.com/nokia/danm/blob/master/cmd/svcwatcher/svcwatcher.go#L74

soo...

Levovar commented 4 years ago

I guess others also have issues with the library :) https://bugzilla.redhat.com/show_bug.cgi?id=1842002

Levovar commented 4 years ago

@TothFerenc any comments on above? I'm kind of on the opinion that this is how stuff works, and we just need to live with it

TothFerenc commented 4 years ago

Maybe we can create a new TODO issue about log module harmonization (use the same logging engine across all DANM components), and this issue can depend on it. Of couse I will close this issue once client libraries are fixed in the meantime.