Service heartbeats to prevent stale data

s4gh commented 8 years ago

I have read documentation at http://vertx.io/docs/vertx-service-discovery/java/ and it does not cover how to prevent service registry from keeping stale data (service that are down). You have concept of "service publisher" who can change service status or remove it from registry. So it seems like it is responsibility of this "service publisher" to perform service heartbeats and keep service registry records up to date. From implementation stand point you can't use service instance itself as a publisher so for every running service there is a need in some external monitoring and "publishing".

It will be really helpful to have built in functionality to to some sort of service heartbeats to prevent stale data in the registry automatically. Similarly how it is implemented in Netxlix Eureka.

Thanks

cescoffier commented 8 years ago

Yes in the current design it's the responsibility to the publisher to:

remove the service (the service is removed form the registry)
mark it as out of service if it makes sense (the service is still in the registry)

Health support has been on the initial design of the service discovery, but dropped as we were thinking about implementing this separately and let the service publisher use this "external" feature. However, so far, I'm not aware of any work on this area.

What would you your opinion about this ? Should it be part of the service discovery, or something separated ?

s4gh commented 8 years ago

Actually I was planning to implement this myself and propose pull request but because of work load and personal things have not done this yet.

Frankly speaking I think it is absolutely needed in the simplest form. And the simplest form for me is ability to specify timeout when publishing your service in addition to current way of publishing a service without any timeout. After timeout service record is removed from the registry.

eagel commented 8 years ago

I think this is a good feature. But I can't absolutely agree with @sveryovka . There is no necessary to remove the registry. Instead, we can add a timestamp to recode. The timestamp is the latest heartbeat time. Service consumers can aware whether is stale. If remove it, who will do this. The best choice is cluster backend. I known the zookeeper have this feature, But not sure Hazelcast have this feature.

s4gh commented 8 years ago

@eagel What you are suggesting means that clients need to know much more than than they really supposed know. Let's look at the trivial example when ServiceA consumes ServiceB. Are you say that if ServiceA gets a list of registered instances of ServiceB it needs to check which of them are still live by comparing their time stamps to select instance with latest timestamp? But this means all the request will be sent to the single instance. And the main problem that ServiceA needs to know how often ServiceB sends its heartbeats to know if specific registered instance it is still alive or already "dead". I am sorry but for me this sound like a horrible idea since ServiceA should now know anything about implementation details of ServiceB like schedule for its heartbeats.

eagel commented 8 years ago

@sveryovka We can implement this in client API layer, The client API user doesn't need do or know any thing for this. And the service side add two field to Record. The first one is "last timestamp", It use to record the last update from service side. This just like a heartbeats. And another this "expire after", the server side provide this value to tell client API the service will may be unavailable after some time. The client API should not return this Record for retrieving. This implementation don't need service consumer do any thing or know any thing, It will be handled by client API . The service provide need configure the "expire after" and "heartbeats interval" to update "last timestamp". And other benefits, The Record always be write by service provider. The concurrency issues will never happen. If removing the record. The problem is who remove it. The service provide can't do this, because of it may be already crashed. The service consumer do this, we must face on concurrency issues. For example, The service consumer found a issue Record, before service consumer remove it. The service provider have been restarted. The service consumer heartbeats make the Record be available again. But at this time, There service consumer remove the Recode. The service can't be found any more. If the cluster backend do this. It must require every backend to support this. This feature just use to check the health of service provider. It help client to know which instant is health, can be used. The client load-balancing doesn't base on this. The client can use round-robin, lowest latency first or random selection.

s4gh commented 7 years ago

The way you describe implementation on the client side can definitely work. benefit of such approach is that implementation of service-registry is simpler. The downside is that it requires solving same task (including all corner cases) in each of the client implementations.

The reason of suggesting to have this functionality on back-end (service registry itself or in some library) is so that there were no need to implement same exact functionality over and over again in each of the clients. Another thing is that with client-side implementation somebody will always forget to implement one of the rare corner case and it may result in some hard to trace problems. These are the reason why somebody will want to extract this logic to one single place.

And you are definitely right that to implement such logic inside vertx-service-discovery there has to be a way to make it working for all implementations. Unfortunately I am not familiar with all of currently supported registry back-ends to be able to say how many of them support such feature naively.

victormp2711 commented 2 months ago

This should be simple, in the service discovery code, you implement a heartbeat with these steps: 1 - have a time scheduler, say 500 millisecond.

2 - in the scheduler handler you call the getServices(records) routine of the discovey service.

3 - with the list of records iterates over this by calling (httpClient or WebClient) to each of the services, and depending on the response from http(code) the status of the service in Discovery service is updated. What do you think?

vert-x3 / vertx-service-discovery

Service heartbeats to prevent stale data #32