spring-projects / spring-session

Spring Session
https://spring.io/projects/spring-session
Apache License 2.0
1.84k stars 1.1k forks source link

Travis CI fails occasionally due to GemFire sample integration test failure #672

Closed vpavic closed 7 years ago

vpavic commented 7 years ago

Build output:

:samples:httpsession-gemfire-boot:runGemFireServer
STARTING GEMFIRE SERVER...
:samples:httpsession-gemfire-boot:integrationTest

sample.AttributeTests > first visit no attributes FAILED
    java.lang.IllegalStateException
        Caused by: org.springframework.context.ApplicationContextException
            Caused by: org.springframework.boot.context.embedded.EmbeddedServletContainerException
                Caused by: org.springframework.beans.factory.UnsatisfiedDependencyException
                    Caused by: org.springframework.beans.factory.UnsatisfiedDependencyException
                        Caused by: org.springframework.beans.factory.BeanCreationException
                            Caused by: com.gemstone.gemfire.cache.NoSubscriptionServersAvailableException
                                Caused by: com.gemstone.gemfire.cache.NoSubscriptionServersAvailableException
    java.lang.NullPointerException

2 tests completed, 1 failed, 1 skipped
:samples:httpsession-gemfire-boot:integrationTest FAILED

FAILURE: Build failed with an exception.

Here are the samples of such jobs:

/cc @jxblum

jxblum commented 7 years ago

It's possible the GemFire server in the Travis CI environment is taking too long to startup. My default timeout is 20 seconds, which should be more than enough time for the server to start (the server often times takes far less than 20 seconds).

All SSDG tests use no more than a 20 second timeout to wait for the server to start and for the client to connect and the run without incident in other CI environments, like build.spring.io, for example. I could find no instances of build failures in build.spring.io related to Subscription Servers available.

It is also possible that prior build failures left orphaned processes on the Travis CI server that are possibly conflicting with subsequent build runs that need to be cleaned up, though this less likely than the former.

rwinch commented 7 years ago

@jxblum I noticed these failures are happening on our Jenkins build too. See

What should we do to fix this?

jxblum commented 7 years ago

@rwinch

As discussed in chat, I am disabling the JMX-based GemFire Management Service in the Spring Boot-based GemFire Server. This should help improve the server's startup time and reduce overhead in starting extraneous services during automated test runs.

Additionally, I am increasing the timeout used by the GemFire cache client to connect to the server from 20 seconds up to 60 seconds. I have also increased the read timeout from 2 seconds up to 15 seconds in waiting for a server's response to a client request (message).

Unfortunately, there is no other/better means than what is already in-place to determine when a GemFire Server has successfully started and, and is "listening" for and "accepting" client connections, and thus be able to avoid the timeout. I know; I have tried.

I have used this pattern, reliably in the SDG test suite for quite some time and all the tests pass both locally and in the Bamboo CI environment without incident between the client and server. I also never have a problem running the Spring Session (Data GemFire) test suite locally, unless there is a legitimate issue. I have not encountered this test failure yet and I have not observed a Spring Session Bamboo CI build environment run with this failure.

Anyway, I hope this addresses any environmental issues. If not, we can reopen this ticket and I will revisit this.

rwinch commented 7 years ago

This is still failing. See https://build.spring.io/browse/SESSION-B13X-159

jxblum commented 7 years ago

@rwinch -

Well, this could just have been a server anomaly (e.g. congestion/slowdown) that threw the timing off. I see that SESSION-B13X-160 is successful once again.

It would seem Redis tests are equally unreliable and prone to fail from time to time. See here and here; same failure.

Also, the spring-session/samples/FindByUsername integration test is a Redis test that is prone to failing. More builds related to Redis (integration) tests failures... #107, #105, #77.

Anyway, I had to go all the way back to SESSION-B13X-DEFAULT-23 on September 7th to find the same GemFire (Boot Sample) test failure, but for a slightly different reason...

Caused by: java.io.IOException: Cannot run program "java": error=2, No such file or directory

Versus...

Caused by: com.gemstone.gemfire.cache.NoSubscriptionServersAvailableException: Primary discovery failed.

Where the later means the GemFire Server did not start, or could not start in a "timely" fashion.

Additionally, I had to go back to Travic CI build #282 (22 days ago, which was well before my comment stating I made changes to try and address this issue) to find the same failure in the history of builds.

Similarly, I had to go back to Jenkins build #131 to find the same GemFire failure, which was 21 days ago on November 16th. Since that time, the Jenkins Spring Session builds seems stable.

As I mention above. I up'd the client/server connection timeout to 60 seconds. This should be more than adequate to allow the GemFire Server to startup. Most GemFire integration tests in Spring Session, and well Spring Data GemFire for that matter (which uses nearly the exact same mechanics to coordinate client/server tests), use a 20 second timeout. Of course, it is a little different starting GemFire in "standalone mode" vs a "Web container" (i.e. Tomcat) as is the case in the httpsession-gemfire-boot sample integration tests.

Ironically, none of the other client/server GemFire (samples) integration tests (i.e. httpsession-gemfire-clientserver-xml or httpsession-gemfire-clientserver) seem to have problems, suggesting that Spring Boot adds significant overhead to affect the timing between the GemFire client and server.

Of course, the other GemFire samples integration tests (i.e. httpsession-gemfire-p2p-xml and httpsession-gemfire-p2p) seem to be completely void of this problem.

Anyway, I have exhausted all the available options that GemFire provides in coordinating between GemFire client and servers, and it is really in the driver where this coordination needs to happen. Any other mechanics (e.g. this) is no more reliable.

So, I am open to suggestions here, but up'ing the timeout (again) is not going to make a difference, especially when "network" resources are unavoidably involved.

rwinch commented 7 years ago

Thanks for the analysis. I agree the Redis failures are an issue too. I created #689 to address this.

Anyway, I have exhausted all the available options that GemFire provides in coordinating between GemFire client and servers, and it is really in the driver where this coordination needs to happen. Any other mechanics (e.g. this) is no more reliable.

Are there any enhancements that we can request to GemFire to make this possible?

I'm going to leave this issue open until we can make these builds consistent.

jxblum commented 7 years ago

I just explored additional GemFire Pool settings (again). I was thinking that perhaps the PoolFactory.setFreeConnectionTimeout(:int) would provide some relief. However, it seems this setting has no effect if the maxConnections is also not set, and I do not set a limit on the number of connections that the client is allowed to create (additionally, I think it is something arbitrary, like 800 connections created by the Pool by default, OOTB, well more than enough).

Regarding...

Are there any enhancements that we can request to GemFire to make this possible?

In fact, I go some of the ideas on how to coordinate (e.g. ClientMemberListener registration) that I employ now from one of the GemFire EAs, over and beyond my old method, which I still use (mostly for fallback).

Anyway, there are literally hundreds of client/server tests in SDG (of course, no Spring Boot) and that build has been 100% reliable.

I guess, I still wonder, what is different about using Spring Boot (in httpsession-gemfire-boot) vs not (i.e. httpsession-gemfire-clientserver[-xml]).

As for...

I'm going to leave this issue open until we can make these builds consistent.

Fair enough.

jxblum commented 7 years ago

@rwinch - I figured this one out too by completing #677. Will add more details later and close this issue.

jxblum commented 7 years ago

I will not close this Issue until the necessary changes in #677 are back ported to 1.4.x per @rwinch's comment in #677.

rwinch commented 7 years ago

@jgrandja I'm still seeing this issue in master running on Travis

jxblum commented 7 years ago

@rwinch I am not exactly sure what would be causing this. My understanding of Travis CI is that each build is run in a separate Docker container. As such, there should not be any lingering (GemFire Server) processes between build runs, which could be causing a conflict (though should not be likely since ports are randomized).

After speaking with the GemFire engineering team the other day, GemFire should continuously attempt to fill the Pool with the minimal number of (set) connections. If the minConnections property is not set, then GemFire will default to a minimum size of 1. Therefore, GemFire will keep attempting to connect to the server until the minimum number of connections requirement is fulfilled, and therefore the application should not see this error...

Caused by: org.apache.geode.cache.NoSubscriptionServersAvailableException

In fact, I prevent the (Spring Boot Web) application from performing any cache client Region operations until I receive notification that the GemFire Server is running.

So, it should instead throw an IllegalStateException with the given message.

My logic is sound. Because we are not seeing this Exception, it could mean we have exposed a GemFire bug/race condition somewhere in GemFire's code.

1 of the things I learned yesterday is, when a client registers interests (and I do, because we need to know about expired Session entries on the server and throw an appropriate Spring Session event on the client) is that GemFire internally opens additional connections to receive subscription queue events (for interests). The client does not use "Pooled connections" to receive those events. The Pool is only ever used by a client to perform cache Region ops. Additionally, a server creates a subscription queue when a client connects and registers interests. But perhaps the later is 2 different things.

Meaning, when a cache client connects and the "subscription" property is set (to true, defaults to false) on a Pool, then when GemFire receives and accepts a connection attempt from a client where subscription has been enabled, I gather the server will go on to create a subscription message queue for the client and then setup a socket to send the client subscription queue events. Internally, I think this "subscription queue" is implemented as another Region. I suspect what might be happening is that the Spring Boot, Spring Session application client (i.e. httpsession-gemfire-boot sample) is too quick to register interests. I.e. I think the subscription queue might need to be fully initialized before client Region operations can begin. If not, NoSubscriptionServersAvailableException is thrown. Just a theory; one I need to validate.

I bet this is what is happening. Generally, I don't think users would (ever? possibly, but probably not very likely) see this in practice since their clusters would already be running. However, in an automated test, things are clearly happening much faster, possible before the server can respond, or rather, is "ready" to respond.

Technically, the other client/server samples... httpsession-gemfire-clientserver[-xml] are (should be) equally susceptible to this failure, but don't seem to be affected, probably because the "timing" just works out and we don't hit the possible GemFire's race condition.

IMO, GemFire logic to connect a client to it's subscription queue should function similarly to the Pool, i.e. keep trying the connection up to a specified timeout. Therefore, I think this is a BUG in GemFire.

Of course, I am trying to think of how to work around this in automated tests.

I will have to think on this more. In the meantime, I am going to run my thoughts above by the GemFire team.

jxblum commented 7 years ago

@rwinch - referred me to a (recent) build failure. All comments that follow pertain to this build failure involving the :samples:httpsession-gemfire-boot Spring Session sample integration test involving a Spring Boot web application using Spring Session backed by GemFire.

First, I have observed...

build   13-Jan-2017 18:37:15    :samples:httpsession-gemfire-boot:runGemFireServer
build   13-Jan-2017 18:37:15    STARTING GEMFIRE SERVER...

And then...

build   13-Jan-2017 18:37:29    sample.AttributeTests > createAttribute FAILED
build   13-Jan-2017 18:37:29        java.lang.IllegalStateException
build   13-Jan-2017 18:37:29            Caused by: org.springframework.beans.factory.UnsatisfiedDependencyException
build   13-Jan-2017 18:37:29                Caused by: org.springframework.beans.factory.BeanCreationException
build   13-Jan-2017 18:37:29                    Caused by: com.gemstone.gemfire.cache.NoSubscriptionServersAvailableException
build   13-Jan-2017 18:37:29                        Caused by: com.gemstone.gemfire.cache.NoSubscriptionServersAvailableException
error   13-Jan-2017 18:37:34    
error   13-Jan-2017 18:37:34    3 tests completed, 1 failed
build   13-Jan-2017 18:37:34    :samples:httpsession-gemfire-boot:integrationTest FAILED

I.e. the cache client (Application) test resulted in a failure ~14 seconds after starting the GemFire Server, which is well before the timeout, currently set to 60 seconds, here. This should not happen.

There are 2 possible reasons why this occurred...

  1. The application encountered a race condition (bug) in GemFire when constructing, configuring and initializing the ClientCache.

Specifically, GemFire's "DEFAULT" Pool is also configured as part of the ClientCache bean creation.

Technically, GemFire is supposed to keep trying to connect to the server in order to fill the minimum number of connections configured for the Pool. If minConnections is not explicitly set, then a Pool defaults to 1. If the GemFire Server is not yet available, the client (Pool logic) should continue to try and connect to fill the minimum number of connections.

  1. This logic is not being tripped.

The first reason seems more likely than the second since we do not see this error message, "GemFire Cache Server failed to start on host [] and port []".

It is unlikely that Spring is not registering and calling the BeanPostProcessor for each bean defined/declared in the Spring ApplicationContext. It is, however, possible that my logic is wrong, although very unlikely since this does not reproduce locally (for reasons that are yet unknown).

Additionally, similar logic was used in the other Spring Session HttpSession GemFire client/server samples, e.g. :samples:httpsession-gemfire-clientserver and :samples:httpsession-gemfire-clientserve-xml.

Also, I recall that the ClientCache bean creation has failed with NoSubscriptionServersAvailableException, but I no longer have the stack trace and details of the error, which I need to further investigate this problem.

jxblum commented 7 years ago

I artificially recreated the problem and I am posting the Stack Trace here for reference...

    2017-01-16 19:50:09.763 ERROR 8452 --- [    Test worker] o.s.test.context.TestContextManager      : Caught exception while allowing TestExecutionListener [org.springframework.test.context.web.ServletTestExecutionListener@619d0fee] to prepare test instance [sample.AttributeTests@42ea1773]

    java.lang.IllegalStateException: Failed to load ApplicationContext
        at org.springframework.test.context.cache.DefaultCacheAwareContextLoaderDelegate.loadContext(DefaultCacheAwareContextLoaderDelegate.java:124) ~[spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.support.DefaultTestContext.getApplicationContext(DefaultTestContext.java:102) ~[spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.web.ServletTestExecutionListener.setUpRequestContextIfNecessary(ServletTestExecutionListener.java:190) ~[spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.web.ServletTestExecutionListener.prepareTestInstance(ServletTestExecutionListener.java:132) ~[spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.TestContextManager.prepareTestInstance(TestContextManager.java:243) ~[spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.createTest(SpringJUnit4ClassRunner.java:226) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.junit4.SpringJUnit4ClassRunner$1.runReflectiveCall(SpringJUnit4ClassRunner.java:288) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) [junit-4.12.jar:4.12]
        at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.methodBlock(SpringJUnit4ClassRunner.java:290) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:245) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:97) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) [junit-4.12.jar:4.12]
        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) [junit-4.12.jar:4.12]
        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) [junit-4.12.jar:4.12]
        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) [junit-4.12.jar:4.12]
        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) [junit-4.12.jar:4.12]
        at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:70) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.junit.runners.ParentRunner.run(ParentRunner.java:363) [junit-4.12.jar:4.12]
        at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:189) [spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:114) [gradle-testing-jvm-3.3.jar:3.3]
        at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:57) [gradle-testing-jvm-3.3.jar:3.3]
        at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:66) [gradle-testing-jvm-3.3.jar:3.3]
        at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51) [gradle-testing-base-3.3.jar:3.3]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_72]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_72]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_72]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_72]
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) [gradle-messaging-3.3.jar:3.3]
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) [gradle-messaging-3.3.jar:3.3]
        at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32) [gradle-messaging-3.3.jar:3.3]
        at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93) [gradle-messaging-3.3.jar:3.3]
        at com.sun.proxy.$Proxy3.processTestClass(Unknown Source) [na:na]
        at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:109) [gradle-testing-base-3.3.jar:3.3]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_72]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_72]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_72]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_72]
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) [gradle-messaging-3.3.jar:3.3]
        at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) [gradle-messaging-3.3.jar:3.3]
        at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:377) [gradle-messaging-3.3.jar:3.3]
        at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:54) [gradle-base-services-3.3.jar:3.3]
        at org.gradle.internal.concurrent.StoppableExecutorImpl$1.run(StoppableExecutorImpl.java:40) [gradle-base-services-3.3.jar:3.3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_72]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
    Caused by: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'sessionRepository' defined in org.springframework.session.data.gemfire.config.annotation.web.http.GemFireHttpSessionConfiguration: Unsatisfied dependency expressed through method 'sessionRepository' parameter 0; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ClusteredSpringSessions' defined in org.springframework.session.data.gemfire.config.annotation.web.http.GemFireHttpSessionConfiguration: Invocation of init method failed; nested exception is org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available.
        at org.springframework.beans.factory.support.ConstructorResolver.createArgumentArray(ConstructorResolver.java:748) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:466) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateUsingFactoryMethod(AbstractAutowireCapableBeanFactory.java:1185) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1079) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:527) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:484) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:302) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:298) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:193) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:751) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:865) ~[spring-context-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:541) ~[spring-context-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:736) ~[spring-boot-2.0.0.BUILD-SNAPSHOT.jar:2.0.0.BUILD-SNAPSHOT]
        at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:369) ~[spring-boot-2.0.0.BUILD-SNAPSHOT.jar:2.0.0.BUILD-SNAPSHOT]
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:313) ~[spring-boot-2.0.0.BUILD-SNAPSHOT.jar:2.0.0.BUILD-SNAPSHOT]
        at org.springframework.boot.test.context.SpringBootContextLoader.loadContext(SpringBootContextLoader.java:118) ~[spring-boot-test-2.0.0.BUILD-SNAPSHOT.jar:2.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.cache.DefaultCacheAwareContextLoaderDelegate.loadContextInternal(DefaultCacheAwareContextLoaderDelegate.java:98) ~[spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.test.context.cache.DefaultCacheAwareContextLoaderDelegate.loadContext(DefaultCacheAwareContextLoaderDelegate.java:116) ~[spring-test-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        ... 45 common frames omitted
    Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'ClusteredSpringSessions' defined in org.springframework.session.data.gemfire.config.annotation.web.http.GemFireHttpSessionConfiguration: Invocation of init method failed; nested exception is org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available.
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1640) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:569) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:484) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:302) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:298) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:193) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:292) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:198) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.config.DependencyDescriptor.resolveCandidate(DependencyDescriptor.java:249) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.doResolveDependency(DefaultListableBeanFactory.java:1128) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveDependency(DefaultListableBeanFactory.java:1056) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.ConstructorResolver.resolveAutowiredArgument(ConstructorResolver.java:833) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.ConstructorResolver.createArgumentArray(ConstructorResolver.java:740) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        ... 63 common frames omitted
    Caused by: org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available.
        at org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:190) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:540) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.executeOnQueuesAndReturnPrimaryResult(PoolImpl.java:805) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.RegisterInterestOp.execute(RegisterInterestOp.java:58) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.ServerRegionProxy.registerInterest(ServerRegionProxy.java:362) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.LocalRegion.processSingleInterest(LocalRegion.java:3893) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.LocalRegion.registerInterest(LocalRegion.java:3972) ~[geode-core-9.0.1.jar:na]
        at org.springframework.data.gemfire.client.ClientRegionFactoryBean.registerInterests(ClientRegionFactoryBean.java:385) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.client.ClientRegionFactoryBean.postProcess(ClientRegionFactoryBean.java:362) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.client.ClientRegionFactoryBean.afterPropertiesSet(ClientRegionFactoryBean.java:111) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.session.data.gemfire.config.annotation.web.http.support.GemFireCacheTypeAwareRegionFactoryBean.newClientRegion(GemFireCacheTypeAwareRegionFactoryBean.java:163) ~[spring-session-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.session.data.gemfire.config.annotation.web.http.support.GemFireCacheTypeAwareRegionFactoryBean.afterPropertiesSet(GemFireCacheTypeAwareRegionFactoryBean.java:98) ~[spring-session-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1699) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1636) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        ... 76 common frames omitted
    Caused by: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available.
        at org.apache.geode.cache.client.internal.QueueManagerImpl.initializeConnections(QueueManagerImpl.java:590) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.QueueManagerImpl.start(QueueManagerImpl.java:303) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.start(PoolImpl.java:322) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.finishCreate(PoolImpl.java:151) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.create(PoolImpl.java:137) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.PoolFactoryImpl.create(PoolFactoryImpl.java:320) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.determineDefaultPool(GemFireCacheImpl.java:3101) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.initializeDeclarativeCache(GemFireCacheImpl.java:1424) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1245) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:794) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.createClient(GemFireCacheImpl.java:767) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.ClientCacheFactory.basicCreate(ClientCacheFactory.java:257) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.ClientCacheFactory.create(ClientCacheFactory.java:207) ~[geode-core-9.0.1.jar:na]
        at org.springframework.data.gemfire.client.ClientCacheFactoryBean.createCache(ClientCacheFactoryBean.java:333) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.resolveCache(CacheFactoryBean.java:250) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.init(CacheFactoryBean.java:191) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.getObject(CacheFactoryBean.java:178) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.getObject(CacheFactoryBean.java:88) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.beans.factory.support.FactoryBeanRegistrySupport.doGetObjectFromFactoryBean(FactoryBeanRegistrySupport.java:168) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.FactoryBeanRegistrySupport.getObjectFromFactoryBean(FactoryBeanRegistrySupport.java:103) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.getObjectForBeanInstance(AbstractBeanFactory.java:1630) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:313) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:198) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.config.DependencyDescriptor.resolveCandidate(DependencyDescriptor.java:249) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.doResolveDependency(DefaultListableBeanFactory.java:1128) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.DefaultListableBeanFactory.resolveDependency(DefaultListableBeanFactory.java:1056) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.ConstructorResolver.resolveAutowiredArgument(ConstructorResolver.java:833) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.ConstructorResolver.createArgumentArray(ConstructorResolver.java:740) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:466) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateUsingFactoryMethod(AbstractAutowireCapableBeanFactory.java:1185) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1079) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:527) ~[spring-beans-5.0.0.BUILD-SNAPSHOT.jar:5.0.0.BUILD-SNAPSHOT]
        ... 75 common frames omitted

So in 1 case, the NoSubscriptionServersAvailableException is definitely caused by attempting to register Interests, which uses a Pool connection and is considered to be the same thing as performing a Region operation, which can be seen here...

    Caused by: org.apache.geode.cache.NoSubscriptionServersAvailableException: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available.
        at org.apache.geode.cache.client.internal.QueueManagerImpl.getAllConnections(QueueManagerImpl.java:190) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnQueuesAndReturnPrimaryResult(OpExecutorImpl.java:540) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.executeOnQueuesAndReturnPrimaryResult(PoolImpl.java:805) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.RegisterInterestOp.execute(RegisterInterestOp.java:58) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.ServerRegionProxy.registerInterest(ServerRegionProxy.java:362) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.LocalRegion.processSingleInterest(LocalRegion.java:3893) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.LocalRegion.registerInterest(LocalRegion.java:3972) ~[geode-core-9.0.1.jar:na]
        at org.springframework.data.gemfire.client.ClientRegionFactoryBean.registerInterests(ClientRegionFactoryBean.java:385) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.client.ClientRegionFactoryBean.postProcess(ClientRegionFactoryBean.java:362) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.client.ClientRegionFactoryBean.afterPropertiesSet(ClientRegionFactoryBean.java:111) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]

In the other case, which appears to be happening after the ClientCache (bean) is created, GemFire is having trouble initializing the connections between the client and server for the server to push subscription queue events/notifications based on interests down to the client, which is apparent from here...

   Caused by: org.apache.geode.cache.NoSubscriptionServersAvailableException: Could not initialize a primary queue on startup. No queue servers available.
        at org.apache.geode.cache.client.internal.QueueManagerImpl.initializeConnections(QueueManagerImpl.java:590) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.QueueManagerImpl.start(QueueManagerImpl.java:303) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.start(PoolImpl.java:322) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.finishCreate(PoolImpl.java:151) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.internal.PoolImpl.create(PoolImpl.java:137) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.PoolFactoryImpl.create(PoolFactoryImpl.java:320) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.determineDefaultPool(GemFireCacheImpl.java:3101) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.initializeDeclarativeCache(GemFireCacheImpl.java:1424) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1245) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:794) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.internal.cache.GemFireCacheImpl.createClient(GemFireCacheImpl.java:767) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.ClientCacheFactory.basicCreate(ClientCacheFactory.java:257) ~[geode-core-9.0.1.jar:na]
        at org.apache.geode.cache.client.ClientCacheFactory.create(ClientCacheFactory.java:207) ~[geode-core-9.0.1.jar:na]
        at org.springframework.data.gemfire.client.ClientCacheFactoryBean.createCache(ClientCacheFactoryBean.java:333) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.resolveCache(CacheFactoryBean.java:250) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.init(CacheFactoryBean.java:191) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.getObject(CacheFactoryBean.java:178) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]
        at org.springframework.data.gemfire.CacheFactoryBean.getObject(CacheFactoryBean.java:88) ~[spring-data-gemfire-2.0.0.BUILD-SNAPSHOT.jar:na]

It is the later one that I think is causing trouble since my "LATCH logic" prevents the Region bean being created by the Spring container until the GemFire Server is available.

jxblum commented 7 years ago

This is a race condition, plain and simple. The QueueManager in the client must be fully initialized before the cache client can "register interests" on the server, regardless of whether the server is available or not. And since the operation to "initialize" the QueueManager (internal) on the cache client (and all it's "necessary connections") is "asynchronous" from the Pool initialization, there are simply no guarantees that the "client" (not the "server") as I initially thought is "ready" to "register interests". This is particularly susceptible to happen in an automated client/server test where the client and server startup and initialization must be carefully coordinated.

jxblum commented 7 years ago

Alright, I believe I finally have a fix for this issue once and for all.

However, while it is hopefully unlikely, it is still possible that the GemFire client/server interaction coordination in the automated, integration tests (no-less) is disrupted due to "timing", possibly because of load on the CI server, for instance.

I.e. the Spring Session GemFire client/server integration tests (:spring-session-samples-boot-gemfire, :spring-session-samples-javaconfig-gemfire-clientserver and :spring-session-samples-xml-gemfire-clientserver) still rely on the successful and timely start of a GemFire Server process that is forked by the Gradle build (for example) during the Spring Session GemFire client/server integrationTest run.

The integration tests serve the role of GemFire cache client attempting to connect to the forked GemFire Server process. The client must coordinate using a time-based, wait mechanism and asynchronous event notification, which essentially boils down to establishing certain connections to the server in order to carry out the cache client functions. There is simply no other way around this in a client/server integration test and I have exhausted all of the facilities that GemFire provides.

So, in the end, it really all boils down to a 60 second timeout. However, I have take additional steps this time to ensure that all the necessary client connections have been established before use (which was unfortunately not previously the case as I initially thought). Similar mechanisms are applied in the SDG test suite itself, which runs quite reliably I might add. Still, if something were to fail, I have additionally spruced up the error messages to get a better idea of what went wrong in that particular instance.

jxblum commented 7 years ago

I should also note that while I am going to close this ticket again, if this should become a problem once more, we can reopen this issue or file a new one, which ever.

rwinch commented 7 years ago

Thanks @jxblum! I am glad to put this issue to rest :)

jxblum commented 7 years ago

Thanks @rwinch! I hope it sticks. I was reviewing the builds (Bamboo, Jenkins, Travis CI) today... so far so good, but we have a few more builds to go before I'd reasonably say this reliably resolved the problem. I am pretty confident in the changes I made based on my understanding and collaboration with the GemFire engineering team. So, we'll see I guess.

Still getting that pesky Assert.notNull(..) NoSuchMethodError in certain tests, though I think the core Spring Framework team may have fixed this glitch as I see a few builds now passing again.

shou1dwe commented 7 years ago

I am not sure whether you came across this post when you was doing the investigation https://communities.vmware.com/thread/411046?tstart=0

Seems to be the same problem, but the solution there is to tweak server GC strategy.

I am hitting the same problem right now with my application, though mine is on a much older Gemfire version. little bit depressing to see even with Geode, the same problem still exists.

Apart from seeing the same Stacktrace as above, there are these log lines in the Gemfire log

[config 2017/02/23 15:33:21.011 GMT [35.643.1] APPLICATION_NAME (NAME) - Instance 1 <main> tid=0x1] Updating membership port.  Port changed from 0 to 34,926.

[info 2017/02/23 15:33:21.104 GMT [35.643.1] APPLICATION_NAME (NAME) - Instance 1 <Cache Client Updater Thread  on hostname_here(gemfire_server_name:6058)<v13>:37184 port {port_here}> tid=0x2d] Cac
he Client Updater Thread  on hostname_here(gemfire_server_name:6058)<v13>:37184 port {port_here} (ip.ip.ip.ip:{port_here}) : ready to process messages.

[config 2017/02/23 15:33:21.117 GMT [35.643.1] APPLICATION_NAME (NAME) - Instance 1 <main> tid=0x1] Pool ManagerPool started with multiuser-authentication=false

[error 2017/02/23 15:33:21.138 GMT [35.643.1] APPLICATION_NAME (NAME) - Instance 1 <main> tid=0x1] Could not find server to create primary client queue. Number of exclude server is 1 and excepti
on is no exception

[error 2017/02/23 15:33:21.142 GMT [35.643.1] APPLICATION_NAME (NAME) - Instance 1 <main> tid=0x1] Could not initialize a primary queue on startup. No queue servers available.

[config 2017/02/23 15:33:21.147 GMT [35.643.1] APPLICATION_NAME (NAME) - Instance 1 <main> tid=0x1] Pool NonLegacyPool started with multiuser-authentication=false

There was once showing something similar to "InstantiatorRecoveryTask failed".

At the same time, the gemfire log on the server will print out something like "Connection closed by EOF".

The thread seems to be a good inspiration .. i will try @jxblum 's methods to guarantee Pool connection before registerInterest.

shou1dwe commented 7 years ago

@jxblum I wonder how is the fix working so far? I see there are two pieces of logic inside your code to guarantee some sort pre-conditions. Below is my understanding so far, from your code and from Gemfire doc on how certain APIs should work. Could you please kindly review and let me know whether that's consistent with what you were describing.

  1. the LATCH logic and MembershipListener memberJoined callback

memberJoined in this listener will be called when member connected to cache server successfully. This event also signify that the connection used for returning Interest Event from server to client is created.

  1. loop and check PoolImpl.getAll().forEach p -> p.isPrimaryUpdaterAlive till success (or timout)

this call will only return true when CacheClientUpdater is alive. or in other word, there is one minimum connection available in the pool. These are the connections/channels used for region ops, including registerInterest call from client to server. PoolImpl.getAll() can be used to access all pools if there are more in addition to the default pool.

jxblum commented 7 years ago

Hi @shou1dwe -

My apologies for the delayed response.

Regarding 1 - LATCH logic and ClientMembershipListener.memberJoined(..)...

The logic here and here (or rather here) ONLY signifies that the "cache client" successfully established a "Pool connection" to the server.

It DOES NOT signify "the connection" between the cache client and the server's subscription queue (manager) for the client, from which events based on the cache clients (registered) "Interests" will be delivered, has been established.

I.e. when a client connects to a server (via the Pool) and the initial handshake is complete, the server will establish a subscription queue from which to send events of "interests" to the client. Well, when the client first connects, this queue will obviously not exist yet and so takes sometime to initialize. Only after this queue has been established by the server will the client be able to register (express) interests and receive events, and the client receives those events on this "separate, non-Pool" connection. If this connection has not been established yet, then any registerInterests call initiated by the client will fail until then.

Regarding 2...

Your following logic is confused...

PoolImpl.getAll().forEach p -> p.isPrimaryUpdaterAlive

It does not do this for each connection in the Pool, as my logic here clearly shows. I.e. I am not iterating over any Pool connections.

Basically, I am simply asking "the Pool" whether the subscription queue has been established (on the server) and whether the client has opened and connected to the subscription queue manager in order to be able to receive events.

In fact, that connection, between the client and the subscription queue on the server from which events of interests are delivered, is NOT even a Pool connection; it is separate, which also means this logic DOES NOT determine "there is one minimum connection available in the pool."

Anyway, in summary...

  1. A client connects to a server via a Pool of connections.

  2. A client will send data access operations (e.g. Region.get(key) / Region.puts(key, value)) performed on the client to the server via Pool connections.

  3. A client will registerInterests using a Pool connection. However, a client is not permitted to register interests until the corresponding subscription queue has been established on the server and the client has separately opened a connection from which it will receive queue events.

The reason for all of this logic in my Spring Session Data GemFire (and Boot) samples is that this code is ran through automated integration tests and so "timing" this correctly is tricky.

You typically do not want nor even need to do this in your own application logic and you certainly do no want your application to depend on GemFire internal classes to figure this sort of thing out. This was only necessary for the automated nature of the tests.

However, it would be nice if GemFire provided additional hooks (or callbacks) for this sort of automated integration testing, which may not be that uncommon, especially in a client/server topology where a developer may be testing different UCs and interests.

Hope this helps to clear the muddy water a bit more.

Cheers! -John

shou1dwe commented 7 years ago

Thank you very much John @jxblum ! My initial understanding is off as you have pointed out, but the code I put together seems correct to me still.

To summarize it, there are two important assertions I need to make after Cache create and before proceed to registerInterest:

  1. memberJoined (-leave and -crash) has correct number of callbacks: pool connection to server established
  2. isPrimaryUpdater returns true for all pools (because we have multiple of them in our setup, and any of them can give EOF probelm. Thats why i employed PoolImpl.getAll() for my implementation): subscription queue from servers to client initialized

An interesting observation after I put in the fix is that, in the case where the isPrimaryUpdater returns false, it will only become true after ~10 seconds, which is the default pool ping interval. I believe they have some correlation. I will do some test once I have a spare hand and share the finding if there is any.

jxblum commented 7 years ago

Hi Zhang-

Following up to your questions...

  1. Yes. The memberJoined callback will be called once the client connects to the server.

However, the memberJoined callback will be called for each server in which the client connects. As you are probably aware, a Pool can be configured to connect to either a Locator, or Locators, or directly to 1 or more Servers.

When configuring the Pool to connect to 1 or more Locators, the Locator(s) send back a list of Servers in the cluster. This list is sorted based on IP or oldest to newest members (something like that). The client Pool will then begin to establish connections to the servers based on the min and max connections properties. For instance, if the min-connections is set to 2 and there are 10 servers in the cluster, then the client Pool only connects to 2 of the 10 servers.

Also keep in mind that the "groups" setting can play a factor. If the Pool is configured to only connect to Servers in a particular "group", then the Locator(s) will only return a list of servers that are actually members of the "group" for which the Pool is configured.

At any rate, the memberJoined callback is called for each connection to a different server. By way of example. if there are 10 servers in the cluster and the Pool has a min-connections of 20, and the client connects to all 10 servers, then the memberJoined callback is triggered 10 times (not 20). Again, it is based on how many different servers the client connects to, not connections.

  1. The PoolImpl.isPrimaryUpdaterAlive() needs to be called on each Pool where subscription-enabled is true. A subscription queue on the server is established for each client Pool where subscription has been enabled. A server-side subscription queue does not exist otherwise.

However, be very careful about relying on GemFire internal classes (e.g. PoolImpl). These internal classes are not part of the GemFire public API and are subject to change.

I was not aware of the correlation between the ping interval and the PoolImpl.isPrimaryUpdaterAlive() becoming true. That is interesting. It is possible to adjust the ping interval too; see here. In Spring Session Data GemFire support, I allow a developer to specify a separately configured, named GemFire Pool.

-j