The old service instance (inactive) is not dropped from prometheus

pnowy commented 6 years ago

Hi,

I'm not sure is it a bug here or in prometheus (or some issue with my configuration) but I decided to ask here (the described problem here exist locally and on k8s cluster on GCP).

I've used the eureka consul adapter in order to expose services (the config will be from my local env). The prometheus.yaml config is the following:

  - job_name: eureka
    metrics_path: '/actuator/prometheus'
    consul_sd_configs:
      - server: 'localhost:8761'

I register single service on the eureka (lets call it my-service), the endpoint /v1/catalog/service/my-service rerurn single instance, prometheus scrape it. So far so good.

In the next step I'm stopping the service localy which is unregistered automatically from eureka. The endpoint /v1/catalog/service/my-service returns empty array but the prometheus still keep this endpoint on targets and display it with the down status. When I start new instance which is automatically registered - the prometheus notice it and show new instance with green status.

The problem is the old instances (removed from eureka) are not removed as targets from prometheus which could cause alarms and similar problems. I will add that I've changed the request-timeout property on eureka service so it's not a problem (mentioned on documentation as replacement for long pooling in consul).

On the k8s cluster the same problem but with pods of course.

Any idea what could be a problem? Thanks

tine2k commented 6 years ago

I could not reproduce your problem. In my tests Prometheus either removes the target right away or shows it as 'DOWN' only for a little while during shutdown of the service. Either way the service is gone after at most 30s or so. Even if I SIGKILL the service, Eureka eventually notices the service that is gone, remove it from the service-array on the endpoint and Prometheus correctly removes it from the target list. Could this be a problem with your Prometheus version?

pnowy commented 6 years ago

Well - I've tested it with 2.2.1 prometheus (we have this version on our clusters).

I will check it with the latest version today (I assume that you have tested it with the latest version available).

tine2k commented 6 years ago

Yes, I tested with 2.3.2.

barrycommins commented 6 years ago

I can reproduce it with this project: https://github.com/barrycommins/prometheus-eureka-spring-boot-demo

I have no idea why it happens though.

barrycommins commented 6 years ago

I've looked into this a bit further, because I've seen it happen a few times now.

Everything seems to work fine when a service cleanly deregister. It is removed quickly. However when I SIGKILL the target application, it never deregisters.

The last emitted change counter stays the same before and after Eureka times out and removes the instance.

2018-09-02 19:19:49.206 DEBUG 11503 --- [nio-8761-exec-1] a.t.e.a.c.service.ServiceChangeDetector  : Last emitted change counter of services 1535912169075
2018-09-02 19:19:55.753  INFO 11503 --- [a-EvictionTimer] c.n.e.registry.AbstractInstanceRegistry  : Running the evict task with compensationTime 0ms
2018-09-02 19:19:55.753  INFO 11503 --- [a-EvictionTimer] c.n.e.registry.AbstractInstanceRegistry  : Evicting 1 items (expired=1, evictionLimit=1)
2018-09-02 19:19:55.753  WARN 11503 --- [a-EvictionTimer] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: expired lease for PROMETHEUS-DEMO-APP/192.168.0.60:prometheus-demo-app
2018-09-02 19:19:55.753  INFO 11503 --- [a-EvictionTimer] c.n.e.registry.AbstractInstanceRegistry  : Cancelled instance PROMETHEUS-DEMO-APP/192.168.0.60:prometheus-demo-app (replication=false)
2018-09-02 19:20:17.566 DEBUG 11503 --- [tionScheduler-5] a.t.e.a.c.service.ServiceChangeDetector  : Last emitted change counter of service PROMETHEUS-DEMO-APP: 1535912169075

I've tried this with the latest version of Prometheus (and even building from source to see what index values the Consul service discovery mechanism was sending).

pnowy commented 5 years ago

Well - I think that my problem could be generated by ENTRYPOINT /go/bin/myapp

Docker runs the script with /bin/sh -c 'command'. This intermediate script gets the SIGTERM, but doesn't send it to the running server app. To avoid the intermediate layer it should be: ENTRYPOINT ["/go/bin/myapp"]

Details: https://stackoverflow.com/questions/37515686/stop-a-running-docker-container-by-sending-sigterm/37517806#37517806

twinformatics / eureka-consul-adapter

The old service instance (inactive) is not dropped from prometheus #9