Open jizhuozhi opened 1 year ago
Hello, our calling relationship A->B, B has two instances (B1,B2). Set the ttl to 30s, and continue to send requests to A. If B2 is offline, A's cache is not updated. As A result, an error occurs when A calls B2. In this case, traffic must be disabled for the cache of A to be updated. I think this is a very big bug that needs to be updated or not.
Hello, @xinlunanxinlunan.
First of all, this is not a bug, what you need is a health check before invoking remote service instance.
Secondly, refreshing means that if there is a new list, we will replace it, but it is irresponsible to clear it if the new list cannot be obtained.
In your scenario, if 50% of the instances are unavailable, then use the history list when the new service list cannot be obtained, and the availability rate will only drop to 50%, but if all the instances are directly expired, then the service cannot be obtained, the availability rate is 0%.
Finally, this Issue is not for discussing when to update, you'd better create a new issue, I will not continue to discuss this topic in this.
@jizhuozhi @OlgaMaciaszek Hey. I have just stumbled upon this issue. K8s API got overloaded and started returning 429. In the same time cache expired and got cleared so for some time we were left with no instances for loadbalancer (Even tho there were healthy instances).
Is there any known walk-around for this?
I was thinking about using HealthCheck
loadbalancer with:
spring.cloud.loadbalancer.health-check.refetch-instances-interval
: truespring.cloud.loadbalancer.health-check.refetch-instances-interval
: 30sspring.cloud.loadbalancer.health-check.repeat-health-check
: falsespring.cloud.loadbalancer.health-check.update-results-list
: falseThis seems like that should behave just like 30s cache but it would not clear cache after refetch-interval. WDYT?
Hello, @sobelek.
What your need in this scene is not scheduling HealthCheck
api, but as same as the core of this Proposal
If the registry goes down, the entire None of the client instances can submit a request to the service provider, and the service is completely unavailable. In a refresh cache, this will have no effect.
And scheduling HealthCheck
api could be implemented via composing Caching
and HealthCheck
suppliers.
Hey. I know I would prefer refreshWrite cache to be available but as it is not I am looking into other possible solution to achieve the same result
Hello, @sobelek . Thanks for your reply.
We may need to align a boundary issue. Generally speaking, health checks check the health status of an instance when given one from instance list, but it has no responsibility to change the instance list. If the logic of refetching instances is incorporated into health checks, the boundary between service discovery and health checks will be broken, and the responsibilities of health checks will no longer be single.
Of course, I tried to abstract your solution. In fact, what we need is still an event source that triggers an event regularly to notify the loadbalancer that the instance list needs to be refreshed, so our solution is the same.
Fully agree with a boundary between discovery and loadbalancing. Thanks for clarifying this.
Event driven discovery or refreshing cache would both be great and would resolve to original issue👍
Hello, @sobelek.
As an aside, and in my experience, usually only application layer protocols that support multiplexing (such as Apache Dubbo) will perform health checks on the client side, because only a simple ping-pong request is required. But for HTTP/1 (Spring MVC or Webflux), because each request is a new TCP connection, when the number of instances is very large, the health check may issue a large number of on-the-fly requests and leave lots of CLOSE_WAIT connections after the request ends. As a result, normal requests cannot be processed. The same is true for the server side.
@jizhuozhi
Yea, true that. We are actually using k8s so healthchecking is done by k8s out of the box.
Back to original issue. Do you have any idea if you guys are going to work on refreshAfter cache for spring loadbalancer?
Hello, @sobelek . There is no out box solution for all CacheManager
, but Caffeine provided refreshAfterWrite
out of box. So you could custom a new CacheManager
based on CaffeineBasedLoadBalancerCacheManager
with your service discovery client.
Thanks, I am trying to do this. Would you like me to also make a PR here when finished?
Thanks, I am trying to do this. Would you like me to also make a PR here when finished?
Of course, welcome. This Issue has been assigned to @OlgaMaciaszek. Do you have any good suggestions @OlgaMaciaszek ?
Thanks, @jizhuozhi, makes sense; however, I'd like to see a separate property for the refresh interval (can be the value you've proposed as default, but needs to be customisable).
@sobelek as indicated in the docs, the HealthCheckServiceInstanceListSupplier
should not be used together with the CachingServiceInstanceListSupplier
as it has it's own caching mechanism. Please feel free to provide a PR for this feature.
@sobelek Please confirm if you will be submitting this PR.
@OlgaMaciaszek I would love to. Don't have much time currently but will try to do it in coming weeks. I am not sure yet about how to set a loader for cache but will do PoC so we can discuss it
Ok, please do and ping me here.
Is your feature request related to a problem? Please describe. Spring Cloud LoadBalancer now only supports expired caching, that is, after a period of time, the cache item will become invalid, and when the same content is requested again, it will re-submit a request to the registry blockly to obtain the service instance. In smooth service there will be spikes.
In addition to causing spikes, this also poses a challenge to high availability. When the cache expires, all cache items are evicted. At this time, there are no related service instances in memory. If the registry goes down, the entire None of the client instances can submit a request to the service provider, and the service is completely unavailable. In a refresh cache, this will have no effect. If the registry goes down, we just don't get the latest data, and slightly older data is better than no data (in fact, the service instance changes very frequently low, the impact of old data is negligible).
Describe the solution you'd like Based on this, I think it is necessary to provide refresh cache instead of just expire cache
A feasible solution is to support configuring the refresh interval and use DiscoveryClient to grab service instances, and at the same time support configuring the expiration time to avoid using expired data for a long time (exposing the problem)