jetstack / navigator

Managed Database-as-a-Service (DBaaS) on Kubernetes
Apache License 2.0
271 stars 31 forks source link

Pilot starts subprocess before pilot leader election is complete #221

Open wallrj opened 6 years ago

wallrj commented 6 years ago

I consider it a bug, that all the unit and E2E tests pass, despite these errors in the logs:

I0124 15:43:11.246] I0124 15:42:19.729088       1 request.go:624] Error in request: resource name may not be empty
I0124 15:43:11.246] E0124 15:42:19.729325       1 leaderelection.go:224] error retrieving resource lock test-cassandra-1516808238-24756/: resource name may not be empty
I0124 15:43:11.246] I0124 15:42:19.729779       1 leaderelection.go:180] failed to acquire lease test-cassandra-1516808238-24756/
I0124 15:43:11.246] I0124 15:42:23.100184       1 request.go:624] Error in request: resource name may not be empty
I0124 15:43:11.246] E0124 15:42:23.100654       1 leaderelection.go:224] error retrieving resource lock test-cassandra-1516808238-24756/: resource name may not be empty
I0124 15:43:11.246] I0124 15:42:23.100956       1 leaderelection.go:180] failed to acquire lease test-cassandra-1516808238-24756/
I0124 15:43:11.247] I0124 15:42:27.175958       1 request.go:624] Error in request: resource name may not be empty
I0124 15:43:11.247] E0124 15:42:27.179353       1 leaderelection.go:224] error retrieving resource lock test-cassandra-1516808238-24756/: resource name may not be empty
I0124 15:43:11.247] I0124 15:42:27.179671       1 leaderelection.go:180] failed to acquire lease test-cassandra-1516808238-24756/

I think we should discuss whether the Pilot should exit unless leader election succeeds within a timeout. On the other hand, perhaps it's a good thing that the database keeps running regardless of bugs in the Pilot. Not sure.

There's an additional leaderelection hook that we could use to wait for successful leader election.

    // OnNewLeader is called when the client observes a leader that is
    // not the previously observed leader. This includes the first observed
    // leader when the client starts.
    OnNewLeader func(identity string)

/kind bug

munnerz commented 6 years ago

As you say, #219 should resolve the e2e test passing problem there.

I think blocking subprocess start on there being a Pilot is a separate enhancement/feature, and has its own nuances that need to be discussed. Whilst we don't have a requirement for this functionality though I think it should be kept on the backlog. If some other new feature does depend on this, or we start seeing failures as a result of it, then we should reconsider.

Namely, I think we need to add 'can be elected conditions' to Pilots (e.g. this Pilot is not a candidate because it isn't healthy). Right now any Pilot can become leader so long as it is running in some form. But that seems like a separate discussion.