spring-projects / spring-data-geode

Spring Data support for Apache Geode
Apache License 2.0
50 stars 37 forks source link

Standalone Locators do not sync #631

Closed sjoshid closed 1 year ago

sjoshid commented 1 year ago

Hi, I have two standalone locators (let's call them L1 and L2) running on two different machines. I have two instances of my app that connect to L1 on startup. But if I bring down L1, I expect them to show up under L2. But that doesnt seem to happen.

Is that how it's suppose to work or Im misunderstanding this?

Env Details

How to reproduce?

Expectation

None of those expectations are met.

One interesting note, if I start another peer, it shows up in L2 (because L1 is down) but the regions are NOT replicated. So at this point, the last peer (and peers started after) has its own replicated region. So I end up having two disjoint replicated regions which is problematic.

jxblum commented 1 year ago

There are 2 different bits to this equation.

SERVER-SIDE

First, you need to consider what is expected on the server-side, between the peers in the cluster. Each server-side, Spring-configured Locator application (@LocatorApplication) should be configured with the Apache Geode locators property pointing to the other Locator.

For example, on hostOne, running LocatorOne, the Apache Geode locators property (i.e. spring.data.gemfire.locators) or the corresponding @LocatorsApplication annotation, locators attribute, should be set as follows:

@LocatorApplication(name = "LocatorOne", port = 11235, locators="hostTwo[12480]")
class MySpringConfiguredApacheGeodeLocatorApplication { 
    // ...
}

Alternatively, in Spring Boot application.properties for LocatorOne running on hostOne you can set:

# Spring Boot application.properties for LocatorOne running on hostOne
spring.data.gemfire.locators=hostTwo[12480]

Subsequently, on hostTwo, LocatorTwo should be configured as follows:

@LocatorApplication(name = "LocatorTwo", port = 12480, locators="hostOne[11235]")
class MySpringConfiguredApacheGeodeLocatorApplication {
    // ...
}

Or, in Spring Boot application.properties for LocatorTwo running on hostTwo, using:

# Spring Boot application.properties for LocatorTwo running on hostTwo
spring.data.gemfire.locators=hostOne[11235]

NOTE: You do not need to configure the Locator port on either the Locator or the client-side if you are using the default port, 10334.

NOTE: Also, the hostnames running the Locator applications must either be the DNS names of your machines or the IP address.

CLIENT-SIDE

Now, for the client-side, you must configure your Apache Geode client Pools with the complete list of Locators in the server-side cluster. For example (using the "DEFAULT" Pool):

@ClientCacheApplication(name = "MyClient" locators = {
    @ClientCacheApplication.Locator(host = "hostOne", port=11235),
    @ClientCacheApplication.Locator(host = "hostTwo", port=12480)
})
class MySpringConfiguredApacheGeodeClientCacheApplication {
    // ...
}

Alternatively, and recommended, you can of course, always set the appropriate property:

# Spring Boot application.properties for the Spring-configured, Apache Geode ClientCache application

spring.data.gemfire.pool.locators=hostOne[11235],hostTwo[12480]

Of course, you can configure any specifically "named" Pool in either SDG's annotation-based configuration or in Spring Boot application.properties, see here.

Also, you have an option to configure client-side Pool properties (attributes) using SDG's PoolConfigurer (see Javadoc; also see Ref Guide) if you need finer-grained, programmatic control (e.g. conditional) over the Pool configuration, such as hosts and ports.

The same sort of programmatical, Configurer based configuration approach is also applicable on the server-side, e.g. LocatorConfigurer.

Anyway, once the configuration is setup properly, then what Apache Geode (or GemFire) "does" and "shows" is outside of the control of Spring (Data Geode).

For example, if the list clients Gfsh command (here) is not working correctly, then this might be an issue in Apache Geode, and specifically with Gfsh.

Also, the Gfsh list clients command specifically states:

"Display a list of connected clients and the servers to which they connect."

In this case, the Gfsh list clients command might only list "servers", as in specifically CacheServer nodes. I don't remember specifically off the top of my head and you might need to test this!

Technically, cache clients only use the Locators to find servers in the cluster. Once the topology is known, the clients perform data access operations (in)directly (depending on the "single-hop" configuration and Region type, of course (e.g. PARTITION Regions) to the data (server) node with the data of interest.

Anyway, hope this helps for now.

jxblum commented 1 year ago

You might also want to have a look at Spring Boot for Apache Geode (SBDG) project documentation

See here in particular, which talks about configuring and bootstrapping Apache Geode Locator applications using Spring Boot with SBDG.

NOTE: SBDG is specific Spring Boot extensions geared directly for Apache Geode. It builds on Spring Data for Apache Geode (SDG) as well as Spring Session for Apache Geode (SSDG) and Spring Test for Apache Geode (STDG).

NOTE: SBDG is a better SDG, ;-)

jxblum commented 1 year ago

I was just informed that this Apache Geode issue might be related to the problem: https://issues.apache.org/jira/browse/GEODE-9822.

Also, if you could please share a bit more details, that might be helpful and/or shed light on your specific problem, such as, but not limited to:

Even if you could provide a small reproducer including the steps you took to reproduce the issue (such as, "I used Gfsh at step X to verify that ???), this would go a long way in helping as well.

Thanks!

sjoshid commented 1 year ago

@jxblum First of all, sorry for such a crappy description. I was pretty confident it was something silly on my side. I have updated the desc with more details.

Your first comment seems to assume Im using client-server topology. But you'd be glad to know that Im using a much simpler Peer-to-Peer topology. I said in the desc and I'll say it again, my app is not a Spring Boot app. It is vanilla Spring app.

jxblum commented 1 year ago

Hi @sjoshid - Thank you for updating your Issue description with more details.

However, my comment that followed stands whether you are using the client/server topology or the p2p topology (and no clients). I specifically addressed the SERVER-SIDE in that comment.

Never-the-less, let's walk through a simple example by following your steps to reproduce the issue. I will be running everything on my localhost, using Java 8 with SDG 2.7.5 (latest patch release in 2.7) which is based on Apache Geode 1.14.4.

NOTE: Whether I run on a single host (e.g. localhost) or multiple hosts really should not matter. The behavior will be the same unless you have other network issues/restrictions/limitations going on.

Java:

$ java -version
java version "1.8.0_351"
Java(TM) SE Runtime Environment (build 1.8.0_351-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.351-b10, mixed mode)

Apache Geode installation & Gfsh:

$ echo $GEODE_HOME
/Users/jblum/pivdev/apache-geode-1.14.4

$ gfsh
    _________________________     __
   / _____/ ______/ ______/ /____/ /
  / /  __/ /___  /_____  / _____  / 
 / /__/ / ____/  _____/ / /    / /  
/______/_/      /______/_/    /_/    1.14.4

Monitor and Manage Apache Geode
gfsh>

NOTE: You can find the source code for the example I am about to demonstrate in GitHub, here.

Next, I scripted the execution of the 2 Apache Geode Locators on ports 11235 and 12480 using Gfsh.

You can run this script and launch the (2) Locators with the following Gfsh run command (you will need to adjust your file system paths):

gfsh>run --file=/Users/jblum/pivdev/spring-data-examples-workspace/spring-data-geode-p2p-example/src/bin/gfsh/twoLocatorCluster.gfsh

You should see output similar to the following after the Locators startup successfully:

1. Executing - start locator --name=LocatorOne --port=11235

Starting a Geode Locator in /Users/jblum/pivdev/lab/LocatorOne...
........
Locator in /Users/jblum/pivdev/lab/LocatorOne on 10.99.199.19[11235] as LocatorOne is currently online.
Process ID: 76101
Uptime: 10 seconds
Geode Version: 1.14.4
Java Version: 1.8.0_351
Log File: /Users/jblum/pivdev/lab/LocatorOne/LocatorOne.log
JVM Arguments: -Dgemfire.enable-cluster-configuration=true -Dgemfire.load-cluster-configuration-from-dir=false -Dgemfire.launcher.registerSignalHandlers=true -Djava.awt.headless=true -Dsun.rmi.dgc.server.gcInterval=9223372036854775806
Class-Path: /Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-core-1.14.4.jar:/Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-dependencies.jar

Successfully connected to: JMX Manager [host=10.99.199.19, port=1099]

Cluster configuration service is up and running.

2. Executing - start locator --name=LocatorTwo --port=12480

Starting a Geode Locator in /Users/jblum/pivdev/lab/LocatorTwo...
.....
Locator in /Users/jblum/pivdev/lab/LocatorTwo on 10.99.199.19[12480] as LocatorTwo is currently online.
Process ID: 76179
Uptime: 5 seconds
Geode Version: 1.14.4
Java Version: 1.8.0_351
Log File: /Users/jblum/pivdev/lab/LocatorTwo/LocatorTwo.log
JVM Arguments: -Dgemfire.default.locators=10.99.199.19[11235] -Dgemfire.enable-cluster-configuration=true -Dgemfire.load-cluster-configuration-from-dir=false -Dgemfire.launcher.registerSignalHandlers=true -Djava.awt.headless=true -Dsun.rmi.dgc.server.gcInterval=9223372036854775806
Class-Path: /Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-core-1.14.4.jar:/Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-dependencies.jar

3. Executing - list members

Member Count : 2

   Name    | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001

************************* Execution Summary ***********************
Script file: /Users/jblum/pivdev/spring-data-examples-workspace/spring-data-geode-p2p-example/src/bin/gfsh/twoLocatorCluster.gfsh

Command-1 : start locator --name=LocatorOne --port=11235
Status    : PASSED

Command-2 : start locator --name=LocatorTwo --port=12480
Status    : PASSED

Command-3 : list members
Status    : PASSED

The essential output at this point is:

gfsh>list members
Member Count : 2

   Name    | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001

Next, we can start our 2 Spring (SDG) configured and bootstrapped Apache Geode servers.

The source code for the Spring configured Apache Geode server is available here. I use the same class to launch both servers.

Additionally, you can see that I declared a Region bean definition in Spring Java config.

You do not need to use Spring [Data Geode] XML configuration to configure your Apache Geode servers. SDG's Annotation-based configuration model is equally powerful and perhaps even more robust and flexible.

I simply need to declare the type of application, as a peer Cache instance using SDG's @PeerCacheApplication annotation.

I then use configuration properties to vary the peer server node "name" (here) as well as set the Locators (here) used to join the cluster.

In my case, I simply created 2 run profiles in my IDE where the servers would run with:

$ java --classpath ... 
  -Dspring.data.gemfire.name=ServerOne 
  -Dspring.data.gemfire.locators=localhost[11235],localhost[12480]
  org.example.spring.geode.server.SpringApacheGeodeServerApplication

The only difference in server 2's run configuration would be the name, set to "ServerTwo". With Apache Geode, peer nodes in a cluster must be uniquely named. This is an Apache Geode requirement.

Additionally, the server process would terminate immediately if not for the block(..) method (see here, then here).

An Apache Geode server terminates immediately because there is no non-daemon Thread running in an Apache Geode server that blocks and prevents the JVM process from exiting. This is even true when using Apache Geode's API (specifically, the CacheFactory to configure and bootstrap a peer Cache) directly, without Spring.

Starting non-CacheServers in Gfsh do block because they do something similar (and here) to what I had to do in code (I know because I wrote it, ;-P).

NOTE: This is not true for CacheServer instances because they do block. They block because they start non-daemon Thread(s) listening for Socket connections from clients (ClientCache instances), therefore do not require any custom logic to prevent the server from terminating prematurely (see here). Anyway, onward...

Output will appear as follows when starting up the Spring configured/bootstrapped Apache Geode servers:

2022-11-14 16:28:49,482  INFO e.logging.internal.LoggingProviderLoader:  75 - Using org.apache.geode.logging.internal.SimpleLoggingProvider for service org.apache.geode.logging.internal.spi.LoggingProvider
2022-11-14 16:28:49,664  INFO he.geode.logging.internal.LoggingSession: 100 - 
---------------------------------------------------------------------------

  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with this
  work for additional information regarding copyright ownership.

  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with the
  License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
  WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the
  License for the specific language governing permissions and limitations
  under the License.

---------------------------------------------------------------------------
Build-Id: dickc 0
Build-Java-Vendor: BellSoft
Build-Java-Version: 1.8.0_272
Build-Platform: Linux 4.15.0-122-generic amd64
Product-Name: Apache Geode
Product-Version: 1.14.4
Source-Date: 2022-03-02 14:27:08 -0800
Source-Repository: support/1.14
Source-Revision: a45cbe90920c5ea9988a1674e729499abb150db1
Running on: /10.99.199.19, 12 cpu(s), x86_64 Mac OS X 12.6 
Communications version: 125
Process ID: 76212
User: jblum
Current dir: /Users/jblum/pivdev/spring-data-examples-workspace/spring-data-geode-p2p-example
Home dir: /Users/jblum
Command Line Parameters:
  -Dspring.data.gemfire.name=ServerTwo
  -Dspring.data.gemfire.locators=localhost[11235],localhost[12480]
  -javaagent:/Applications/IntelliJ IDEA 2021.3.3 CE.app/Contents/lib/idea_rt.jar=55498:/Applications/IntelliJ IDEA 2021.3.3 CE.app/Contents/bin
  -Dfile.encoding=UTF-8
Class Path:
...
..
.
2022-11-14 16:28:49,671  INFO he.geode.logging.internal.LoggingSession: 104 - Startup Configuration: 
### GemFire Properties defined with api ###
disable-auto-reconnect=true
locators=localhost[11235],localhost[12480]
log-level=config
mcast-port=0
name=ServerTwo
use-cluster-configuration=false
### GemFire Properties using default values ###
ack-severe-alert-threshold=0
ack-wait-threshold=15
archive-disk-space-limit=0
...
..
.
2022-11-14 16:28:49,676  INFO he.geode.internal.InternalDataSerializer: 416 - initializing InternalDataSerializer with 8 services
2022-11-14 16:28:49,696  INFO buted.internal.ClusterOperationExecutors: 188 - Serial Queue info : THROTTLE_PERCENT: 0.75 SERIAL_QUEUE_BYTE_LIMIT :41943040 SERIAL_QUEUE_THROTTLE :31457280 TOTAL_SERIAL_QUEUE_BYTE_LIMIT :83886080 TOTAL_SERIAL_QUEUE_THROTTLE :31457280 SERIAL_QUEUE_SIZE_LIMIT :20000 SERIAL_QUEUE_SIZE_THROTTLE :15000
2022-11-14 16:28:49,827  INFO ributed.internal.membership.gms.Services: 201 - Starting membership services
Nov 14, 2022 4:28:49 PM org.jgroups.protocols.UNICAST3 init
INFO: both the regular and OOB thread pools are disabled; UNICAST3 could be removed (JGRP-2069)
2022-11-14 16:28:49,886  INFO ributed.internal.membership.gms.Services: 575 - Established local address 10.99.199.19(ServerTwo:76212):41003
2022-11-14 16:28:49,887  INFO ributed.internal.membership.gms.Services: 406 - JGroups channel created (took 60ms)
2022-11-14 16:28:49,909  INFO istributed.internal.direct.DirectChannel: 148 - GemFire P2P Listener started on /10.99.199.19:56980
2022-11-14 16:28:49,911  INFO ributed.internal.membership.gms.Services: 714 - Started failure detection server thread on /10.99.199.19:59614.
2022-11-14 16:28:49,924  INFO ributed.internal.membership.gms.Services:1194 - received FindCoordinatorResponse(coordinator=10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, fromView=true, viewId=2, registrants=[10.99.199.19(ServerTwo:76212):41003], senderId=10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000|2] members: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}]) from locator HostAndPort[localhost:11235]
2022-11-14 16:28:49,924  INFO ributed.internal.membership.gms.Services:1197 - Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join
...
..
.
2022-11-14 16:28:50,246  INFO ributed.internal.membership.gms.Services:1767 - Finished joining (took 334ms).
2022-11-14 16:28:50,247  INFO uted.internal.ClusterDistributionManager: 473 - Starting DistributionManager 10.99.199.19(ServerTwo:76212)<v3>:41003.  (took 540 ms)
2022-11-14 16:28:50,247  INFO uted.internal.ClusterDistributionManager: 686 - Initial (distribution manager) view, View[10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000|3] members: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}, 10.99.199.19(ServerTwo:76212)<v3>:41003]
2022-11-14 16:28:50,274  INFO uted.internal.ClusterDistributionManager: 611 - Member 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 is equivalent or in the same redundancy zone.
2022-11-14 16:28:50,274  INFO uted.internal.ClusterDistributionManager: 611 - Member 10.99.199.19(ServerOne:76210)<v2>:41002 is equivalent or in the same redundancy zone.
2022-11-14 16:28:50,274  INFO uted.internal.ClusterDistributionManager: 611 - Member 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001 is equivalent or in the same redundancy zone.
2022-11-14 16:28:50,276  INFO uted.internal.ClusterDistributionManager: 389 - DistributionManager 10.99.199.19(ServerTwo:76212)<v3>:41003 started on localhost[12480],localhost[11235]. There were 3 other DMs. others: [10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002, 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000]  (took 593 ms) 
...
..
.
2022-11-14 16:28:51,212  INFO e.geode.internal.cache.DistributedRegion:1637 - Initialization of region PdxTypes completed
2022-11-14 16:28:51,222  INFO gframework.data.gemfire.CacheFactoryBean: 281 - Created new Apache Geode version [1.14.4] Cache [ServerTwo]
2022-11-14 16:28:51,223  INFO gframework.data.gemfire.CacheFactoryBean: 281 - Connected to Distributed System [ServerTwo] as Member [10.99.199.19(ServerTwo:76212)<v3>:41003] in Group(s) [[]] with Role(s) [[]] on Host [10.99.199.19] having PID [76212]
2022-11-14 16:28:51,236  INFO data.gemfire.ReplicatedRegionFactoryBean: 281 - Creating Region [ExampleReplicateRegion] in Cache [ServerTwo]
2022-11-14 16:28:51,242  INFO data.gemfire.ReplicatedRegionFactoryBean: 281 - Created Region [ExampleReplicateRegion]
2022-11-14 16:28:51,245  WARN e.geode.internal.cache.DistributedRegion: 217 - Region ExampleReplicateRegion is being created with scope DISTRIBUTED_NO_ACK but enable-network-partition-detection is enabled in the distributed system.  This can lead to cache inconsistencies if there is a network failure.
2022-11-14 16:28:51,245  INFO e.geode.internal.cache.DistributedRegion:1135 - Initializing region ExampleReplicateRegion
2022-11-14 16:28:51,248  INFO ode.internal.cache.InitialImageOperation: 514 - Region ExampleReplicateRegion requesting initial image from 10.99.199.19(ServerOne:76210)<v2>:41002
2022-11-14 16:28:51,249  INFO ode.internal.cache.InitialImageOperation: 586 - ExampleReplicateRegion is done getting image from 10.99.199.19(ServerOne:76210)<v2>:41002. isDeltaGII is false
2022-11-14 16:28:51,249  INFO e.geode.internal.cache.DistributedRegion:1637 - Initialization of region ExampleReplicateRegion completed
Peer Cache instance [ServerTwo] running and connected to cluster [localhost[11235],localhost[12480]]
Press <enter> to stop...

NOTE: Output is from the first server I started; output from the second server should be similar.

After the servers successfully start, then we can list members in Gfsh to see the completed Apache Geode cluster:

gfsh>list members
Member Count : 4

   Name    | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001
ServerOne  | 10.99.199.19(ServerOne:76210)<v2>:41002
ServerTwo  | 10.99.199.19(ServerTwo:76212)<v3>:41003

Additional, we can list regions in our cluster:

gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion

The ExampleReplicateRegion was created with Spring configuration using a Region bean definition when bootstrapping the server(s).

We can describe region as follows:

gfsh>describe region --name=/ExampleReplicateRegion
Name            : ExampleReplicateRegion
Data Policy     : replicate
Hosting Members : ServerTwo
                  ServerOne

Non-Default Attributes Shared By Hosting Members  

 Type  |    Name     | Value
------ | ----------- | ---------
Region | size        | 0
       | data-policy | REPLICATE

We can clearly see that the "ExampleReplicateRegion" Region is a REPLICATE and that the Region is hosted on "ServerOne" and "ServerTwo".

So far so good!

Now, I am going to kill the "LocatorOne" process in a separate terminal to simulate a crash.

In the output when we started the Locators using Gfsh, we know "LocatorOne's" process ID (76101):

Locator in /Users/jblum/pivdev/lab/LocatorOne on 10.99.199.19[11235] as LocatorOne is currently online.
Process ID: 76101
Uptime: 10 seconds
...

So, I can run:

$ kill -9 76101

You should see output similar to the following (from either Spring peer Cache server logs):

...
2022-11-14 16:40:51,639  INFO ributed.internal.membership.gms.Services:1273 - Performing availability check for suspect member 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 reason=member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:51,642  INFO rnal.tcpserver.AdvancedSocketCreatorImpl: 104 - Failed to connect to /10.99.199.19:50302
2022-11-14 16:40:51,643  INFO ributed.internal.membership.gms.Services:1314 - Availability check failed for member 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000
2022-11-14 16:40:51,646  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:51,646  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.19(ServerOne:76210)<v2>:41002 for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:51,646  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: failed availability check
2022-11-14 16:40:51,647  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001 for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:55,274  WARN org.apache.geode.internal.tcp.TCPConduit: 771 - Attempting TCP/IP reconnect to  10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000
2022-11-14 16:40:55,275  INFO org.apache.geode.internal.tcp.Connection:1051 - Connection: shared=false ordered=true failed to connect to peer 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 because: java.net.ConnectException: Connection refused
2022-11-14 16:40:56,707  INFO ributed.internal.membership.gms.Services:1500 - received new view: View[10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001|10] members: [10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}, 10.99.199.19(ServerTwo:76212)<v3>:41003]  crashed: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000]
old view is: View[10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000|3] members: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}, 10.99.199.19(ServerTwo:76212)<v3>:41003]
2022-11-14 16:40:56,710  INFO uted.internal.ClusterDistributionManager:1907 - Member at 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 unexpectedly left the distributed cache: departed membership view
2022-11-14 16:40:57,280  INFO org.apache.geode.internal.tcp.TCPConduit: 852 - Ending reconnect attempt because 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 has disappeared.
...

Because I started "LocatorOne" first, it became the "Coordinator" in the cluster and also was the node that ran the Manager that Gfsh was connected to, so naturally, Gfsh got disconnected as well.

gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>

However, we can reconnect to "LocatorTwo", which I suspect became the new "Coordinattor" for the remaining cluster.

gfsh>connect --locator=localhost[12480]
Connecting to Locator at [host=localhost, port=12480] ..
Connecting to Manager at [host=10.99.199.19, port=1099] ..
Successfully connected to: [host=10.99.199.19, port=1099]

You are connected to a cluster of version: 1.14.4

And, as I was suspecting, all the members ("LocatorTwo", started with Gfsh), along with the 2 servers that I configured and bootstrapped with Spring remain:

gfsh>list members
Member Count : 3

   Name    | Id
---------- | ------------------------------------------------------------------
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001 [Coordinator]
ServerOne  | 10.99.199.19(ServerOne:76210)<v2>:41002
ServerTwo  | 10.99.199.19(ServerTwo:76212)<v3>:41003

As does the Region and the servers hosting it:

gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion

gfsh>describe region --name=/ExampleReplicateRegion
Name            : ExampleReplicateRegion
Data Policy     : replicate
Hosting Members : ServerTwo
                  ServerOne

Non-Default Attributes Shared By Hosting Members  

 Type  |    Name     | Value
------ | ----------- | ---------
Region | size        | 0
       | data-policy | REPLICATE

So, all works as I would expect.

If there were a problem, then I would also suspect that this problem would exist with or without Spring, in other words, if the peer Cache application servers were simply configured and bootstrapped (coded directly) with Apache Geode's API, the issue would occur.

sjoshid commented 1 year ago

Solid explanation as always and it all makes sense.

I will be running everything on my localhost

I think that is the problem. I saw no issues when you run both the locators on the same machine.

Maybe you can try with different machines?

jxblum commented 1 year ago

Currently, I do not have the resources and access to run this example on multiple machines. But, I would also expect it to work just the same. I can look more into this tomorrow.

If it does not work then:

1) There maybe a network issue 2) Then it might possibly be a bug in Apache Geode 3) Etc...

Spring (Data Geode) is not doing anything special in this case. It isn't opening Sockets or connecting to Apache Geode servers using custom code in SDG. It is simply passing configuration to the Apache Geode server instance during construction, using Apache Geode's API, and Apache Geode is doing all the work.

SDG, specifically with respect to configuration, is simply a facade around Apache Geode's API to apply Spring's conventions and robust programming model to configure Apache Geode in a Spring [Boot] context with ease.

For example, the following class is equivalent to what Spring Data Geode does, to an extent, but simply using Apache Geode's API, not Spring. So, you should see the same result in either case.

NOTE: In this case, using just the ApacheGeodeServerApplication class, I am using Apache Geode properties to configure the nodes on startup, namely: gemfire.name and gemfire.locators JVM System properties.

sjoshid commented 1 year ago

Got it. Looks like you are hinting at creating a Geode ticket. And if so, I dont have issues.

Let me try to reproduce this outside of SDG.

Thank you.

jxblum commented 1 year ago

I will see if someone on the GemFire team is available to run this example in a multi-host capacity tomorrow. In the meantime, I have dug up an old work laptop of mine to see if I can run this example across both machines.

Will report back when I know more, either way.

jxblum commented 1 year ago

Ok, I have got my 2nd machine up and running.

Replaying the scenario...

I start "LocatorOne" on host 1 (10.99.199.19):

gfsh>start locator --name=LocatorOne --port=11235
Starting a Geode Locator in /Users/jblum/pivdev/lab/LocatorOne...
........
Locator in /Users/jblum/pivdev/lab/LocatorOne on 10.99.199.19[11235] as LocatorOne is currently online.
Process ID: 77993
Uptime: 10 seconds
Geode Version: 1.14.4
Java Version: 1.8.0_351
Log File: /Users/jblum/pivdev/lab/LocatorOne/LocatorOne.log
JVM Arguments: -Dgemfire.enable-cluster-configuration=true -Dgemfire.load-cluster-configuration-from-dir=false -Dgemfire.launcher.registerSignalHandlers=true -Djava.awt.headless=true -Dsun.rmi.dgc.server.gcInterval=9223372036854775806
Class-Path: /Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-core-1.14.4.jar:/Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-dependencies.jar

Successfully connected to: JMX Manager [host=10.99.199.19, port=1099]

Cluster configuration service is up and running.

I then start my second Locator on host 2 (10.99.199.20) with the following Gfsh start locator command (output is similar):

gfsh> start locator --name=LocatorTwo --port=12480 --locators=10.99.199.19[11235]
...

I then list members (from both hosts, output looks similar):

gfsh>list members
Member Count : 2

   Name    | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000

NOTE: Notice the IP addresses.

I then restart my Spring configured/bootstrapped servers.

Once started successfully, I again list members (from both hosts, output is similar):

gfsh>list members
Member Count : 4

   Name    | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000
ServerOne  | 10.99.199.19(ServerOne:78107)<v2>:41001
ServerTwo  | 10.99.199.19(ServerTwo:78119)<v3>:41002

NOTE: I simply run both Spring configured/bootstrapped server instances on the same host, with "LocatorOne". Again, notice the IP addresses.

I then list regions and describe region (from both hosts, output looks similar):

gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion

gfsh>describe region --name=/ExampleReplicateRegion
Name            : ExampleReplicateRegion
Data Policy     : replicate
Hosting Members : ServerTwo
                  ServerOne

Non-Default Attributes Shared By Hosting Members  

 Type  |    Name     | Value
------ | ----------- | ---------
Region | size        | 0
       | data-policy | REPLICATE

Now, I kill "LocatorOne" on host 1 (10.99.199.19):

$ kill -9 77993

Once again, the logs from the servers looks the same:

Press <enter> to stop...
2022-11-14 18:32:31,039  INFO ributed.internal.membership.gms.Services:1273 - Performing availability check for suspect member 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 reason=member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:31,042  INFO rnal.tcpserver.AdvancedSocketCreatorImpl: 104 - Failed to connect to /10.99.199.19:47132
2022-11-14 18:32:31,042  INFO ributed.internal.membership.gms.Services:1314 - Availability check failed for member 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000
2022-11-14 18:32:31,045  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:31,045  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.19(ServerOne:78107)<v2>:41001 for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:31,045  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: failed availability check
2022-11-14 18:32:31,329  INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000 for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:34,156  WARN org.apache.geode.internal.tcp.TCPConduit: 771 - Attempting TCP/IP reconnect to  10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000
2022-11-14 18:32:34,158  INFO org.apache.geode.internal.tcp.Connection:1051 - Connection: shared=false ordered=true failed to connect to peer 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 because: java.net.ConnectException: Connection refused
2022-11-14 18:32:36,160  WARN org.apache.geode.internal.tcp.Connection:1028 - Connection: Attempting reconnect to peer 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000
2022-11-14 18:32:36,345  INFO ributed.internal.membership.gms.Services:1500 - received new view: View[10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000|10] members: [10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000, 10.99.199.19(ServerOne:78107)<v2>:41001{lead}, 10.99.199.19(ServerTwo:78119)<v3>:41002]  crashed: [10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000]
old view is: View[10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000|3] members: [10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000, 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000, 10.99.199.19(ServerOne:78107)<v2>:41001{lead}, 10.99.199.19(ServerTwo:78119)<v3>:41002]
2022-11-14 18:32:36,349  INFO uted.internal.ClusterDistributionManager:1907 - Member at 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 unexpectedly left the distributed cache: departed membership view
2022-11-14 18:32:38,165  INFO org.apache.geode.internal.tcp.TCPConduit: 852 - Ending reconnect attempt because 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 has disappeared.

Gfsh was disconnected on host 1 (10.99.199.19):

gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>

However, I can reconnect to the cluster on host 2 (10.99.199.20) from host 1:

gfsh>connect --locator=10.99.199.20[12480]
Connecting to Locator at [host=10.99.199.20, port=12480] ..
Connecting to Manager at [host=10.99.199.20, port=1099] ..
Successfully connected to: [host=10.99.199.20, port=1099]

You are connected to a cluster of version: 1.14.4

And, everything seems intact:

gfsh>list members
Member Count : 3

   Name    | Id
---------- | -----------------------------------------------------------------
LocatorTwo | 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000 [Coordinator]
ServerOne  | 10.99.199.19(ServerOne:78107)<v2>:41001
ServerTwo  | 10.99.199.19(ServerTwo:78119)<v3>:41002

gfsh>
gfsh>
gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion

gfsh>
gfsh>
gfsh>describe region --name=/ExampleReplicateRegion
Name            : ExampleReplicateRegion
Data Policy     : replicate
Hosting Members : ServerTwo
                  ServerOne

Non-Default Attributes Shared By Hosting Members  

 Type  |    Name     | Value
------ | ----------- | ---------
Region | size        | 0
       | data-policy | REPLICATE

This output is the same on host 2 (10.99.199.20) running "LocatorTwo".

So, it would appear to work on my end.

For complete coverage, I also tested this scenario with the pure Apache Geode API configured/bootstrapped servers as well.

It all worked as expected.

sjoshid commented 1 year ago

You are creating both locators from the same machine. Maybe that is what is missing in my case. Im creating them separately. Im going to give this a shot.

sjoshid commented 1 year ago

You are creating both locators from the same machine. Maybe that is what is missing in my case. Im creating them separately. Im going to give this a shot.

Wrong. I see you started second locator on a different machine and gave it details about first.

And it worked this time!! My understanding was wrong.

Thanks.