Catalog sync de-register before register

ltagliamonte-dd commented 4 years ago

Current catalog sync behavior at bootstrap is to de-register and than register service in Consul. This is tremendous dangerous because will make endpoints unavailable for a short period of time. here part of the bootstrap logs:

2020-09-02T15:20:17.992Z [INFO]  K8s namespace syncing configuration: k8s namespaces allowed to be synced=Set{*} k8s namespaces denied from syncing="Set{kube-public, infra, infra-system, kube-system}"
Listening on ":8080"...
2020-09-02T15:20:17.992Z [INFO]  to-consul/source: starting runner for endpoints
2020-09-02T15:20:18.092Z [INFO]  to-consul/controller: initial cache sync complete
2020-09-02T15:20:18.092Z [INFO]  to-consul/source.controller/endpoints: initial cache sync complete
... 
2020-09-02T15:20:18.718Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=mobile-bff-web service-consul-namespace=
2020-09-02T15:20:18.719Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=service-web service-consul-namespace=
2020-09-02T15:20:18.725Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=dd-service-web service-consul-namespace=
2020-09-02T15:20:18.792Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=grpc-order-tracker service-consul-namespace=
2020-09-02T15:20:18.794Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=service-web service-consul-namespace=
2020-09-02T15:20:18.796Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=kafka-rest-proxy-web service-consul-namespace=
2020-09-02T15:20:18.888Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=data-service-consumer service-consul-namespace=
2020-09-02T15:20:18.892Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=data-service-kafka-consumer service-consul-namespace=
2020-09-02T15:20:18.893Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=mobile-bff-web service-consul-namespace=
2020-09-02T15:20:18.898Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=service-worker service-consul-namespace=
2020-09-02T15:20:18.900Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=search-server service-consul-namespace=
2020-09-02T15:20:18.908Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=exp-service-web service-consul-namespace=
2020-09-02T15:20:18.987Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=ntf-service-grpc-synchronous service-consul-namespace=
2020-09-02T15:20:18.990Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=sy-service-web-search service-consul-namespace=
2020-09-02T15:20:18.998Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=consumer-web service-consul-namespace=
2020-09-02T15:20:19.004Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=fee-service-web service-consul-namespace=
2020-09-02T15:20:19.008Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=lct-service-web service-consul-namespace=
2020-09-02T15:20:19.088Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=realtime-feature-service-job-manager service-consul-namespace=
2020-09-02T15:20:19.089Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=sy-service-web service-consul-namespace=
2020-09-02T15:20:19.092Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=cdc-manager service-consul-namespace=
2020-09-02T15:20:19.092Z [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=ntf-service-web service-consul-namespace=

lkysow commented 4 years ago

I believe what's happening is that although we're blocking on "initial sync" here: https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L133, func (s *ConsulSyncer) Sync(rs []*api.CatalogRegistration) is called by

// sync calls the Syncer.Sync function from the generated registrations.
//
// Precondition: lock must be held
func (t *ServiceResource) sync() {
    // NOTE(mitchellh): This isn't the most efficient way to do this and
    // the times that sync are called are also not the most efficient. All
    // of these are implementation details so lets improve this later when
    // it becomes a performance issue and just do the easy thing first.
    rs := make([]*consulapi.CatalogRegistration, 0, len(t.consulMap)*4)
    for _, set := range t.consulMap {
        rs = append(rs, set...)
    }

    // Sync, which should be non-blocking in real-world cases
    t.Syncer.Sync(rs)
}

which is called from a number of locations:

func (t *ServiceResource) Upsert(key string, raw interface{}) error
func (t *ServiceResource) doDelete(key string) {
func (t *serviceEndpointsResource) Upsert(key string, raw interface{}) error {
func (t *serviceEndpointsResource) Delete(key string) error {

When the process first starts, upsert is called for the first service, which then calls sync with a single service in t.consulMap.

I'm curious if you've set the -consul-write-interval to anything custom?

ltagliamonte-dd commented 4 years ago

@lkysow yes the -consul-write-interval is set to 10s.

ltagliamonte-dd commented 4 years ago

i've started looking into this issue, the code is a tricky to understand/follow to be honest, but as you can see from this debug log:

2020-09-24T13:23:14.655-0700 [DEBUG] to-consul/source: [generateRegistrations] generating registration: key=merchant-tax-service/merchant-tax-service-web
2020-09-24T13:23:14.655-0700 [DEBUG] to-consul/source: generated registration: key=merchant-tax-service/merchant-tax-service-web service=merchant-tax-service-web namespace= instances=1
...
2020-09-24T13:23:14.663-0700 [INFO]  to-consul/sink: Service under test:: service=java-service-template-web
2020-09-24T13:23:14.663-0700 [INFO]  to-consul/sink: Service Set:: EXTRA_VALUE_AT_END=Set{merchant-tax-service-web}
2020-09-24T13:23:14.663-0700 [INFO]  to-consul/sink: invalid service found, scheduling for delete: service-name=java-service-template-web service-consul-namespace=`

At the bootstrap the Service Set is not fully updated and this makes watchReapableServices function to basically schedule all the service for deletion until Service Set converges and contains all the service.
watchReapableServices get's started here https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L141
watchReapableServices seems to wait for a lock https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L168
the lock get unlocked here https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L132
the rs []*api.CatalogRegistration at the bootstrap is empty https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/syncer.go#L109
the Sync method is executed here https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L670
rs []*api.CatalogRegistration gets populated here from t.consulMap https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L665
t.consulMap is populated by generateRegistrations https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L315
generateRegistrations is invoked by Upsert https://github.com/hashicorp/consul-k8s/blob/master/catalog/to-consul/resource.go#L728
Upsert is part of the controller, invoked at by processSingle method https://github.com/hashicorp/consul-k8s/blob/master/helper/controller/controller.go#L175
processSingle is part of the Controller Run method https://github.com/hashicorp/consul-k8s/blob/master/helper/controller/controller.go#L123
the Run method is executed in the command.go file https://github.com/hashicorp/consul-k8s/blob/master/subcommand/sync-catalog/command.go#L290

After following "the data" i believe that the condition to run the controller is to have the cache in sync, and all Consul registrations ready, but they both depend on each other, can't run the registration if you don't run the controller.

@lkysow would love to hear from you about this analysis.

I've also noticed some things in the overall design that seems incorrect:

the re-sync period of the Service and Endpoints caches has been set to 0 (no re-sync) this will create incoherence in the data when events are missed.
when enabling only to-consul sync both Consul and k8s data sources are used are source of truth, should we use only the kubernetes one and in the control loop only trying to align the current Consul state with kubernetes one?

ltagliamonte-dd commented 4 years ago

@lkysow does hashicorp have any plans of fixing this issue? All services gets removed everytime the container is deployed/restarted etc..

lkysow commented 4 years ago

Hi, yes we would like to fix it. It's just a matter of getting to it.

praymann commented 4 years ago

Looks like this is confirmation of what I uncovered here: https://github.com/hashicorp/consul-k8s/issues/280#issuecomment-646327906

Is that correct?

TomasKohout commented 3 years ago

Hi guys. :slightly_smiling_face:

We have faced outage because of this behavior. Basically when you are lucky enough you can cause outage of about 20-40 seconds when you restart consul client (CC) daemon set and consul sync (CS) deployment.

From what I've found out it seems that newly created pod of CS will reape all services and if you have a bad day in that moment CC pod will became unavailable. The result is that CS is not able to register services hence you are left with unregistered services. :slightly_frowning_face:

mogopz commented 3 years ago

Just chiming in to say we've been hit by this too unfortunately - a huge number of our services became unavailable for 1-2 minutes causing a mini-outage.

TomasKohout commented 3 years ago

It seems that quick fix could be not to use HOST IP but rather k8s Service that point to Consul Agents DS.

ruaridhangus-fbr commented 3 years ago

Would like to see a change here the below shows our issues on a catalogue sync restart / redeploy ect.

We have protected the sync service now from node scaling and giving it more than necessary CPU and Memory to avoid any OOM crashes but its still liable to do this on re-deployment which can have a major impact on services with a high hit rate.

ltagliamonte-dd commented 3 years ago

I didn't find any way to fix the current code without major changes that would require multiple PRs and reviews. The biggest flaw I think is in the fact that the sync considers both k8s and consul as sources of truth instead of using only k8s data.

The project also lacks of other important features for me, like:

use tx api, instead of single transaction (this reduces the CPU load on Consul servers of more than 70% in my use case)
support multi cluster
expose metrics for monitoring
use endpointSlices

So I did a complete rewrite of the catalog sync, that fixes all this major flaws. I'm working with my legal on open source it, will keep you posted, when this happens.

david-yu commented 3 years ago

Thanks @ltagliamonte-dd appreciate your candid feedback. We are aware there needs to be some large changes within catalog sync to improve this experience, larger than what can be addressed with a simple PR. We do want to improve this over time but have been more focused on Service Mesh features on the Consul K8s side as of recent. We would be interesting in working with you to see what a solution could look like!

mogopz commented 3 years ago

Hey @david-yu, I appreciate you have different priorities but I feel like getting this fixed should be higher. Service mesh support is awesome but having a bug that causes a mini outage whenever a pod is restarted (potentially a fairly frequent event in k8s!) is nasty - especially for those of us using the sync on production clusters. 🙏

ahamlinman commented 2 years ago

I'm exploring the possibility of writing a custom K8s-to-Consul sync controller along the lines of what @ltagliamonte-dd described above, as this behavior is effectively a hard blocker for our current plans to start gradually introducing Kubernetes into our environments.

@ltagliamonte-dd I take it that the work you described above was never open-sourced? If that's the case, would you be open to sharing any other notable implementation details or pitfalls that you encountered, particularly for some of the more advanced features like multi-cluster support? I'm also curious about the overall level of development and maintenance effort that has gone into the project. (Also, thank you for all of the analysis and notes you've already provided in this ticket—they are immensely helpful!)

I'm still evaluating some different possibilities for our use case and can't make many promises, but if I end up taking this route I will plan to explore the possibility of releasing that work as open source.

Update 2022-02-10: It looks like we'll be able to satisfy our long term requirements with some other ongoing investments in our service discovery and load balancing infrastructure, and that the delay workaround @CodyPubNub suggests below will cover us in the short term. Thank you to all who responded!

ltagliamonte-dd commented 2 years ago

@ahamlinman we really want to open source the work, I just didn't have time to complete the to-dos for opening the repo. It has been almost 2 years we ran our own version in production and we didn’t have any problem with it, i've implemented all the features I posted above.

The game changer has been the use of the TX api, that have reduced of almost 50% the consul server usage on big blue green deployments (a lot of IP change in few ms)

One thing I want to stress is that in my version we use a flat network among multiple k8s clusters and we sync in consul the pod IPs and we use client-side LB, so we skip a lot of traditional in kernel networking that happens when you use cluster.local endpoints. I don't support cluster.local or external lbs

CodyPubNub commented 2 years ago

Due to the instability caused by services being dropped when the catalog-sync container pod restarts or is moved to a different node, our team forked this project and added a hack which delays the initial reaping, giving the endpoints API time to be scraped and placed into the appropriate data structures:

david-yu commented 2 years ago

Hi there, Consul PM here. We haven't actively worked on catalog-sync in quite some time, however we are open to reviewing PRs if you believe you would like us to review changes (as in add support for utilizing the Tx api) that address this issue! We are open to contributions.

ltagliamonte-dd commented 2 years ago

@CodyPubNub yeah that was something I found as well, but the project per se for our standards it is not production ready, we can't run a such important piece of infra with no metrics.

srstsavage commented 2 years ago

@ltagliamonte-dd I'm interested in your catalog sync rewrite as well, sounds great and the network constraints are suitable for my use case. Hope to see you release it!

hashicorp / consul-k8s

Catalog sync de-register before register #319