smallrye / smallrye-stork

SmallRye Stork is a service discovery and client side-load balancing framework.
http://smallrye.io/smallrye-stork/
Apache License 2.0
77 stars 23 forks source link

If the discovery service is down calls will fail even with caching enabled #231

Closed vsevel closed 2 years ago

vsevel commented 2 years ago

Create a custom discovery service that extends CachingServiceDiscovery, integrating with an external service (e.g. consul). Define a refresh period of 10 seconds or so. Start the client application and server application and get the client to make a few calls on the server application (it will start by fetching the server url from the external discovery service). Now shut this external discovery service down, and try to get the client to continue calling the server app. After the refresh period, an attempt will be made to contact the discovery service, which will fail. As a result all business calls will fail, until the discovery service is restored. The access to the discovery service should be more resilient (although the discovery service should be very resilient in itself, since it is a central and critical service), especially when services have been successfully fetched at least once. The proposed enhancement would be to return the previously fetched instances (if any) if fetching new instances fail. There should be also a configurable read timeout and connect timeout on the discovery. Since getting a valid instance is probably blocking for the client, we do not want to make him wait indefinitely, or even on a large connect or read timeout (e.g. 30 secs), if some instances have been fetched in the past. Refreshing the instances from the discovery service should be seen as "best effort", when instances have been previously fetched in the past. On the very first attempt, when there are not cached instances, we may use a bigger read and connect timeouts, because if we fail fetching instances, the business call will fail anyway. So it is worth waiting a little bit longer. But when instances exist and have been cached, we should wait for a very short period of time only (e.g. 100ms depending on the 90% response time of the external discovery service), and fall back quickly on the cached instances if we do not get an answer in time.

michalszynkiewicz commented 2 years ago

@vsevel do you have a reproducer for this behavior? WIth CachingServiceDiscovery it should not happen but maybe we have some bug.

Does your service discovery return a Uni with a failure or throw an exception?

vsevel commented 2 years ago

hello @michalszynkiewicz I have created a branch for that specific issue: https://github.com/vsevel/stork-issue/tree/issue_231 Instructions to reproduce:

mvn clean install

# first window (app)
java -Dquarkus.http.port=8080 -jar target/quarkus-app/quarkus-run.jar 

# second window (discovery service)
java -Dquarkus.http.port=8060 -jar target/quarkus-app/quarkus-run.jar 

# third window
 curl http://localhost:8080/hello/stork => SUCCESS
 curl http://localhost:8080/hello/stork => SUCCESS
...

now kill discovery service and wait refresh period

 curl http://localhost:8080/hello/stork => KO
 curl http://localhost:8080/hello/stork => KO
...

restart discovery service and wait refresh period or so:

 curl http://localhost:8080/hello/stork => SUCCESS
 curl http://localhost:8080/hello/stork => SUCCESS
...

Does your service discovery return a Uni with a failure or throw an exception?

it is returning a Uni:

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public Uni<String> discovery(@QueryParam("app") String app) {
        String address = addresses.get(app);
        UniCreate uni = Uni.createFrom();
        if (address != null) {
            log.info("discovery returning " + address + " for app " + app);
            return uni.item(address);
        } else {
            return uni.failure(new RuntimeException("could not find address for app " + app));
        }
    }

did I make any mistake?

michalszynkiewicz commented 2 years ago

how do you invoke it from the ServiceDiscovery implementation?

vsevel commented 2 years ago
    @Override
    public Uni<List<ServiceInstance>> fetchNewServiceInstances(List<ServiceInstance> previousInstances) {
        DiscoveryClient discoveryClient = Arc.container().instance(DiscoveryClientProvider.class).get().getClient();
        long timeout = previousInstances.isEmpty() ? 10000L : 100L;
        String addresses = discoveryClient.discovery(config.getApp()).await().atMost(Duration.ofMillis(timeout));
        UniCreate uni = Uni.createFrom();
        try {
            return uni.item(map(addresses));
        } catch (TimeoutException e) {
            return uni.failure(e);
        }
    }

I need to rebase on your code that you provided in the PR, to see if that changes anything.

vsevel commented 2 years ago

note also that I did not see that behavior in 1.0.0 with <artifactId>stork-load-balancer-response-time</artifactId> If I did not mixed up my tests, it seems a (negative) change of behavior in 1.0.1-SN

EDIT : not sure if this was on this issue. please disregard this comment.

vsevel commented 2 years ago

hmm. not seeing the issue anymore after the rebase on my project (including your code). still surprising to me because even if was doing some blocking code, I was catching the mutiny timeout, and resending a uni failure. still, I do not see a reason to keep that issue opened, unless you need me to make additional tests in that context.

I am seeing a log error in the CachingServiceDiscovery with a long stacktrace when the discovery service is not reachable:

2022-02-10 12:41:38,318 ERROR [io.sma.sto.imp.CachingServiceDiscovery] (vert.x-eventloop-thread-1) Failed to fetch service instances: javax.ws.rs.ProcessingException: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8060
    at org.jboss.resteasy.reactive.client.handlers.ClientSendRequestHandler$2.accept(ClientSendRequestHandler.java:230)
    at org.jboss.resteasy.reactive.client.handlers.ClientSendRequestHandler$2.accept(ClientSendRequestHandler.java:226)
    at io.smallrye.context.impl.wrappers.SlowContextualConsumer.accept(SlowContextualConsumer.java:21)
    at io.smallrye.mutiny.helpers.UniCallbackSubscriber.onFailure(UniCallbackSubscriber.java:65)
    at io.smallrye.mutiny.operators.uni.UniOperatorProcessor.onFailure(UniOperatorProcessor.java:54)

Since the business call works regardless using the cache, an option would be to log a warn only (with no stack), or in debug with a stack to cut on verbosity.

vsevel commented 2 years ago

for sake of completion I have seen this warning while I was running my test (on the app side):

2022-02-10 12:45:56,569 WARN  [io.ver.cor.imp.VertxImpl] (executor-thread-0) You're already on a Vert.x context, are you sure you want to create a new Vertx instance?
2022-02-10 12:46:40,715 WARN  [io.ver.cor.imp.VertxImpl] (executor-thread-0) You're already on a Vert.x context, are you sure you want to create a new Vertx instance?

It happened only twice (but the discovering service was called multiple times).

michalszynkiewicz commented 2 years ago

oh, it's a bug in Quarkus, I'll try to fix it. Created an issue for it https://github.com/quarkusio/quarkus/issues/23578

michalszynkiewicz commented 2 years ago

as the last thing is not related to the report, I'll close this issue. If I missed something please reopen it