Open ltagliamonte-dd opened 4 years ago
I believe what's happening is that although we're blocking on "initial sync" here: https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L133, func (s *ConsulSyncer) Sync(rs []*api.CatalogRegistration)
is called by
// sync calls the Syncer.Sync function from the generated registrations.
//
// Precondition: lock must be held
func (t *ServiceResource) sync() {
// NOTE(mitchellh): This isn't the most efficient way to do this and
// the times that sync are called are also not the most efficient. All
// of these are implementation details so lets improve this later when
// it becomes a performance issue and just do the easy thing first.
rs := make([]*consulapi.CatalogRegistration, 0, len(t.consulMap)*4)
for _, set := range t.consulMap {
rs = append(rs, set...)
}
// Sync, which should be non-blocking in real-world cases
t.Syncer.Sync(rs)
}
which is called from a number of locations:
func (t *ServiceResource) Upsert(key string, raw interface{}) error
func (t *ServiceResource) doDelete(key string) {
func (t *serviceEndpointsResource) Upsert(key string, raw interface{}) error {
func (t *serviceEndpointsResource) Delete(key string) error {
When the process first starts, upsert is called for the first service, which then calls sync
with a single service in t.consulMap
.
I'm curious if you've set the -consul-write-interval
to anything custom?
@lkysow yes the -consul-write-interval
is set to 10s.
i've started looking into this issue, the code is a tricky to understand/follow to be honest, but as you can see from this debug log:
2020-09-24T13:23:14.655-0700 [DEBUG] to-consul/source: [generateRegistrations] generating registration: key=merchant-tax-service/merchant-tax-service-web
2020-09-24T13:23:14.655-0700 [DEBUG] to-consul/source: generated registration: key=merchant-tax-service/merchant-tax-service-web service=merchant-tax-service-web namespace= instances=1
...
2020-09-24T13:23:14.663-0700 [INFO] to-consul/sink: Service under test:: service=java-service-template-web
2020-09-24T13:23:14.663-0700 [INFO] to-consul/sink: Service Set:: EXTRA_VALUE_AT_END=Set{merchant-tax-service-web}
2020-09-24T13:23:14.663-0700 [INFO] to-consul/sink: invalid service found, scheduling for delete: service-name=java-service-template-web service-consul-namespace=`
At the bootstrap the Service Set
is not fully updated and this makes watchReapableServices
function to basically schedule all the service for deletion until Service Set
converges and contains all the service.
watchReapableServices get's started here https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L141
watchReapableServices seems to wait for a lock https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L168
the lock get unlocked here https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L132
the rs []*api.CatalogRegistration
at the bootstrap is empty https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L109
the Sync method is executed here https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L670
rs []*api.CatalogRegistration
gets populated here from t.consulMap
https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L665
t.consulMap
is populated by generateRegistrations
https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L315
generateRegistrations
is invoked by Upsert
https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L728
Upsert
is part of the controller, invoked at by processSingle
method https://github.com/hashicorp/consul-k8s/blob/master/helper/controller/controller.go#L175
processSingle
is part of the Controller Run
method https://github.com/hashicorp/consul-k8s/blob/master/helper/controller/controller.go#L123
the Run
method is executed in the command.go file https://github.com/hashicorp/consul-k8s/blob/master/subcommand/sync-catalog/command.go#L290
After following "the data" i believe that the condition to run the controller is to have the cache in sync, and all Consul registrations ready, but they both depend on each other, can't run the registration if you don't run the controller.
@lkysow would love to hear from you about this analysis.
I've also noticed some things in the overall design that seems incorrect:
Service
and Endpoints
caches has been set to 0 (no re-sync) this will create incoherence in the data when events are missed.to-consul
sync both Consul and k8s data sources are used are source of truth, should we use only the kubernetes one and in the control loop only trying to align the current Consul state with kubernetes one?@lkysow does hashicorp have any plans of fixing this issue? All services gets removed everytime the container is deployed/restarted etc..
Hi, yes we would like to fix it. It's just a matter of getting to it.
Looks like this is confirmation of what I uncovered here: https://github.com/hashicorp/consul-k8s/issues/280#issuecomment-646327906
Is that correct?
Hi guys. :slightly_smiling_face:
We have faced outage because of this behavior. Basically when you are lucky enough you can cause outage of about 20-40 seconds when you restart consul client (CC) daemon set and consul sync (CS) deployment.
From what I've found out it seems that newly created pod of CS will reape all services and if you have a bad day in that moment CC pod will became unavailable. The result is that CS is not able to register services hence you are left with unregistered services. :slightly_frowning_face:
Just chiming in to say we've been hit by this too unfortunately - a huge number of our services became unavailable for 1-2 minutes causing a mini-outage.
It seems that quick fix could be not to use HOST IP but rather k8s Service that point to Consul Agents DS.
Would like to see a change here the below shows our issues on a catalogue sync restart / redeploy ect.
We have protected the sync service now from node scaling and giving it more than necessary CPU and Memory to avoid any OOM crashes but its still liable to do this on re-deployment which can have a major impact on services with a high hit rate.
I didn't find any way to fix the current code without major changes that would require multiple PRs and reviews. The biggest flaw I think is in the fact that the sync considers both k8s and consul as sources of truth instead of using only k8s data.
The project also lacks of other important features for me, like:
So I did a complete rewrite of the catalog sync, that fixes all this major flaws. I'm working with my legal on open source it, will keep you posted, when this happens.
Thanks @ltagliamonte-dd appreciate your candid feedback. We are aware there needs to be some large changes within catalog sync to improve this experience, larger than what can be addressed with a simple PR. We do want to improve this over time but have been more focused on Service Mesh features on the Consul K8s side as of recent. We would be interesting in working with you to see what a solution could look like!
Hey @david-yu, I appreciate you have different priorities but I feel like getting this fixed should be higher. Service mesh support is awesome but having a bug that causes a mini outage whenever a pod is restarted (potentially a fairly frequent event in k8s!) is nasty - especially for those of us using the sync on production clusters. 🙏
I'm exploring the possibility of writing a custom K8s-to-Consul sync controller along the lines of what @ltagliamonte-dd described above, as this behavior is effectively a hard blocker for our current plans to start gradually introducing Kubernetes into our environments.
@ltagliamonte-dd I take it that the work you described above was never open-sourced? If that's the case, would you be open to sharing any other notable implementation details or pitfalls that you encountered, particularly for some of the more advanced features like multi-cluster support? I'm also curious about the overall level of development and maintenance effort that has gone into the project. (Also, thank you for all of the analysis and notes you've already provided in this ticket—they are immensely helpful!)
I'm still evaluating some different possibilities for our use case and can't make many promises, but if I end up taking this route I will plan to explore the possibility of releasing that work as open source.
Update 2022-02-10: It looks like we'll be able to satisfy our long term requirements with some other ongoing investments in our service discovery and load balancing infrastructure, and that the delay workaround @CodyPubNub suggests below will cover us in the short term. Thank you to all who responded!
@ahamlinman we really want to open source the work, I just didn't have time to complete the to-dos for opening the repo. It has been almost 2 years we ran our own version in production and we didn’t have any problem with it, i've implemented all the features I posted above.
The game changer has been the use of the TX api, that have reduced of almost 50% the consul server usage on big blue green deployments (a lot of IP change in few ms)
One thing I want to stress is that in my version we use a flat network among multiple k8s clusters and we sync in consul the pod IPs and we use client-side LB, so we skip a lot of traditional in kernel networking that happens when you use cluster.local
endpoints. I don't support cluster.local
or external lbs
Due to the instability caused by services being dropped when the catalog-sync container pod restarts or is moved to a different node, our team forked this project and added a hack which delays the initial reaping, giving the endpoints API time to be scraped and placed into the appropriate data structures:
Hi there, Consul PM here. We haven't actively worked on catalog-sync in quite some time, however we are open to reviewing PRs if you believe you would like us to review changes (as in add support for utilizing the Tx api) that address this issue! We are open to contributions.
@CodyPubNub yeah that was something I found as well, but the project per se for our standards it is not production ready, we can't run a such important piece of infra with no metrics.
@ltagliamonte-dd I'm interested in your catalog sync rewrite as well, sounds great and the network constraints are suitable for my use case. Hope to see you release it!
Current catalog sync behavior at bootstrap is to de-register and than register service in Consul. This is tremendous dangerous because will make endpoints unavailable for a short period of time. here part of the bootstrap logs: