spring-cloud / spring-cloud-commons

Common classes used in different Spring Cloud implementations
Apache License 2.0
704 stars 700 forks source link

LoadBalancerCacheManager supports refresh cache, not just expire cache #1215

Open jizhuozhi opened 1 year ago

jizhuozhi commented 1 year ago

Is your feature request related to a problem? Please describe. Spring Cloud LoadBalancer now only supports expired caching, that is, after a period of time, the cache item will become invalid, and when the same content is requested again, it will re-submit a request to the registry blockly to obtain the service instance. In smooth service there will be spikes.

In addition to causing spikes, this also poses a challenge to high availability. When the cache expires, all cache items are evicted. At this time, there are no related service instances in memory. If the registry goes down, the entire None of the client instances can submit a request to the service provider, and the service is completely unavailable. In a refresh cache, this will have no effect. If the registry goes down, we just don't get the latest data, and slightly older data is better than no data (in fact, the service instance changes very frequently low, the impact of old data is negligible).

Describe the solution you'd like Based on this, I think it is necessary to provide refresh cache instead of just expire cache

A feasible solution is to support configuring the refresh interval and use DiscoveryClient to grab service instances, and at the same time support configuring the expiration time to avoid using expired data for a long time (exposing the problem)

setCaffeine(Caffeine.newBuilder()
    .initialCapacity(properties.getCapacity())
    .refreshAfterWrite(properties.getTtl().dividedBy(2))
    .expireAfterWrite(properties.getTtl())
    .softValues());
setCacheLoader((key) -> discoveryClient.getInstances((String) key));
xinlunanxinlunan commented 1 year ago

Hello, our calling relationship A->B, B has two instances (B1,B2). Set the ttl to 30s, and continue to send requests to A. If B2 is offline, A's cache is not updated. As A result, an error occurs when A calls B2. In this case, traffic must be disabled for the cache of A to be updated. I think this is a very big bug that needs to be updated or not.

jizhuozhi commented 1 year ago

Hello, @xinlunanxinlunan.

First of all, this is not a bug, what you need is a health check before invoking remote service instance.

Secondly, refreshing means that if there is a new list, we will replace it, but it is irresponsible to clear it if the new list cannot be obtained.

In your scenario, if 50% of the instances are unavailable, then use the history list when the new service list cannot be obtained, and the availability rate will only drop to 50%, but if all the instances are directly expired, then the service cannot be obtained, the availability rate is 0%.

Finally, this Issue is not for discussing when to update, you'd better create a new issue, I will not continue to discuss this topic in this.

sobelek commented 11 months ago

@jizhuozhi @OlgaMaciaszek Hey. I have just stumbled upon this issue. K8s API got overloaded and started returning 429. In the same time cache expired and got cleared so for some time we were left with no instances for loadbalancer (Even tho there were healthy instances).

Is there any known walk-around for this?

I was thinking about using HealthCheck loadbalancer with:

This seems like that should behave just like 30s cache but it would not clear cache after refetch-interval. WDYT?

jizhuozhi commented 11 months ago

Hello, @sobelek.

What your need in this scene is not scheduling HealthCheck api, but as same as the core of this Proposal

If the registry goes down, the entire None of the client instances can submit a request to the service provider, and the service is completely unavailable. In a refresh cache, this will have no effect.

And scheduling HealthCheck api could be implemented via composing Caching and HealthCheck suppliers.

sobelek commented 11 months ago

Hey. I know I would prefer refreshWrite cache to be available but as it is not I am looking into other possible solution to achieve the same result

jizhuozhi commented 11 months ago

Hello, @sobelek . Thanks for your reply.

We may need to align a boundary issue. Generally speaking, health checks check the health status of an instance when given one from instance list, but it has no responsibility to change the instance list. If the logic of refetching instances is incorporated into health checks, the boundary between service discovery and health checks will be broken, and the responsibilities of health checks will no longer be single.

Of course, I tried to abstract your solution. In fact, what we need is still an event source that triggers an event regularly to notify the loadbalancer that the instance list needs to be refreshed, so our solution is the same.

sobelek commented 11 months ago

Fully agree with a boundary between discovery and loadbalancing. Thanks for clarifying this.

Event driven discovery or refreshing cache would both be great and would resolve to original issue👍

jizhuozhi commented 11 months ago

Hello, @sobelek.

As an aside, and in my experience, usually only application layer protocols that support multiplexing (such as Apache Dubbo) will perform health checks on the client side, because only a simple ping-pong request is required. But for HTTP/1 (Spring MVC or Webflux), because each request is a new TCP connection, when the number of instances is very large, the health check may issue a large number of on-the-fly requests and leave lots of CLOSE_WAIT connections after the request ends. As a result, normal requests cannot be processed. The same is true for the server side.

sobelek commented 11 months ago

@jizhuozhi

Yea, true that. We are actually using k8s so healthchecking is done by k8s out of the box.

Back to original issue. Do you have any idea if you guys are going to work on refreshAfter cache for spring loadbalancer?

jizhuozhi commented 11 months ago

Hello, @sobelek . There is no out box solution for all CacheManager, but Caffeine provided refreshAfterWrite out of box. So you could custom a new CacheManager based on CaffeineBasedLoadBalancerCacheManager with your service discovery client.

sobelek commented 11 months ago

Thanks, I am trying to do this. Would you like me to also make a PR here when finished?

jizhuozhi commented 11 months ago

Thanks, I am trying to do this. Would you like me to also make a PR here when finished?

Of course, welcome. This Issue has been assigned to @OlgaMaciaszek. Do you have any good suggestions @OlgaMaciaszek ?

OlgaMaciaszek commented 10 months ago

Thanks, @jizhuozhi, makes sense; however, I'd like to see a separate property for the refresh interval (can be the value you've proposed as default, but needs to be customisable). @sobelek as indicated in the docs, the HealthCheckServiceInstanceListSupplier should not be used together with the CachingServiceInstanceListSupplier as it has it's own caching mechanism. Please feel free to provide a PR for this feature.

OlgaMaciaszek commented 10 months ago

@sobelek Please confirm if you will be submitting this PR.

sobelek commented 10 months ago

@OlgaMaciaszek I would love to. Don't have much time currently but will try to do it in coming weeks. I am not sure yet about how to set a loader for cache but will do PoC so we can discuss it

OlgaMaciaszek commented 10 months ago

Ok, please do and ping me here.