Closed sjoshid closed 1 year ago
There are 2 different bits to this equation.
SERVER-SIDE
First, you need to consider what is expected on the server-side, between the peers in the cluster. Each server-side, Spring-configured Locator application (@LocatorApplication
) should be configured with the Apache Geode locators
property pointing to the other Locator.
For example, on hostOne
, running LocatorOne
, the Apache Geode locators
property (i.e. spring.data.gemfire.locators
) or the corresponding @LocatorsApplication
annotation, locators
attribute, should be set as follows:
@LocatorApplication(name = "LocatorOne", port = 11235, locators="hostTwo[12480]")
class MySpringConfiguredApacheGeodeLocatorApplication {
// ...
}
Alternatively, in Spring Boot application.properties for LocatorOne
running on hostOne
you can set:
# Spring Boot application.properties for LocatorOne running on hostOne
spring.data.gemfire.locators=hostTwo[12480]
Subsequently, on hostTwo
, LocatorTwo
should be configured as follows:
@LocatorApplication(name = "LocatorTwo", port = 12480, locators="hostOne[11235]")
class MySpringConfiguredApacheGeodeLocatorApplication {
// ...
}
Or, in Spring Boot application.properties for LocatorTwo
running on hostTwo
, using:
# Spring Boot application.properties for LocatorTwo running on hostTwo
spring.data.gemfire.locators=hostOne[11235]
NOTE: You do not need to configure the Locator port on either the Locator or the client-side if you are using the default port, 10334.
NOTE: Also, the hostnames running the Locator applications must either be the DNS names of your machines or the IP address.
CLIENT-SIDE
Now, for the client-side, you must configure your Apache Geode client Pools
with the complete list of Locators in the server-side cluster. For example (using the "DEFAULT" Pool
):
@ClientCacheApplication(name = "MyClient" locators = {
@ClientCacheApplication.Locator(host = "hostOne", port=11235),
@ClientCacheApplication.Locator(host = "hostTwo", port=12480)
})
class MySpringConfiguredApacheGeodeClientCacheApplication {
// ...
}
Alternatively, and recommended, you can of course, always set the appropriate property:
# Spring Boot application.properties for the Spring-configured, Apache Geode ClientCache application
spring.data.gemfire.pool.locators=hostOne[11235],hostTwo[12480]
Of course, you can configure any specifically "named" Pool
in either SDG's annotation-based configuration or in Spring Boot application.properties, see here.
Also, you have an option to configure client-side Pool
properties (attributes) using SDG's PoolConfigurer
(see Javadoc; also see Ref Guide) if you need finer-grained, programmatic control (e.g. conditional) over the Pool
configuration, such as hosts and ports.
The same sort of programmatical, Configurer
based configuration approach is also applicable on the server-side, e.g. LocatorConfigurer
.
Anyway, once the configuration is setup properly, then what Apache Geode (or GemFire) "does" and "shows" is outside of the control of Spring (Data Geode).
For example, if the list clients
Gfsh command (here) is not working correctly, then this might be an issue in Apache Geode, and specifically with Gfsh.
Also, the Gfsh list clients
command specifically states:
"Display a list of connected clients and the servers to which they connect."
In this case, the Gfsh list clients
command might only list "servers", as in specifically CacheServer
nodes. I don't remember specifically off the top of my head and you might need to test this!
Technically, cache clients only use the Locators to find servers in the cluster. Once the topology is known, the clients perform data access operations (in)directly (depending on the "single-hop" configuration and Region type, of course (e.g. PARTITION
Regions) to the data (server) node with the data of interest.
Anyway, hope this helps for now.
You might also want to have a look at Spring Boot for Apache Geode (SBDG) project documentation
See here in particular, which talks about configuring and bootstrapping Apache Geode Locator applications using Spring Boot with SBDG.
NOTE: SBDG is specific Spring Boot extensions geared directly for Apache Geode. It builds on Spring Data for Apache Geode (SDG) as well as Spring Session for Apache Geode (SSDG) and Spring Test for Apache Geode (STDG).
NOTE: SBDG is a better SDG, ;-)
I was just informed that this Apache Geode issue might be related to the problem: https://issues.apache.org/jira/browse/GEODE-9822.
Also, if you could please share a bit more details, that might be helpful and/or shed light on your specific problem, such as, but not limited to:
Even if you could provide a small reproducer including the steps you took to reproduce the issue (such as, "I used Gfsh at step X to verify that ???), this would go a long way in helping as well.
Thanks!
@jxblum First of all, sorry for such a crappy description. I was pretty confident it was something silly on my side. I have updated the desc with more details.
Your first comment seems to assume Im using client-server topology. But you'd be glad to know that Im using a much simpler Peer-to-Peer topology. I said in the desc and I'll say it again, my app is not a Spring Boot app. It is vanilla Spring app.
Hi @sjoshid - Thank you for updating your Issue description with more details.
However, my comment that followed stands whether you are using the client/server topology or the p2p topology (and no clients). I specifically addressed the SERVER-SIDE in that comment.
Never-the-less, let's walk through a simple example by following your steps to reproduce the issue. I will be running everything on my localhost
, using Java 8 with SDG 2.7.5
(latest patch release in 2.7
) which is based on Apache Geode 1.14.4
.
NOTE: Whether I run on a single host (e.g.
localhost
) or multiple hosts really should not matter. The behavior will be the same unless you have other network issues/restrictions/limitations going on.
Java:
$ java -version
java version "1.8.0_351"
Java(TM) SE Runtime Environment (build 1.8.0_351-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.351-b10, mixed mode)
Apache Geode installation & Gfsh:
$ echo $GEODE_HOME
/Users/jblum/pivdev/apache-geode-1.14.4
$ gfsh
_________________________ __
/ _____/ ______/ ______/ /____/ /
/ / __/ /___ /_____ / _____ /
/ /__/ / ____/ _____/ / / / /
/______/_/ /______/_/ /_/ 1.14.4
Monitor and Manage Apache Geode
gfsh>
NOTE: You can find the source code for the example I am about to demonstrate in GitHub, here.
Next, I scripted the execution of the 2 Apache Geode Locators on ports 11235
and 12480
using Gfsh.
You can run this script and launch the (2) Locators with the following Gfsh run
command (you will need to adjust your file system paths):
gfsh>run --file=/Users/jblum/pivdev/spring-data-examples-workspace/spring-data-geode-p2p-example/src/bin/gfsh/twoLocatorCluster.gfsh
You should see output similar to the following after the Locators startup successfully:
1. Executing - start locator --name=LocatorOne --port=11235
Starting a Geode Locator in /Users/jblum/pivdev/lab/LocatorOne...
........
Locator in /Users/jblum/pivdev/lab/LocatorOne on 10.99.199.19[11235] as LocatorOne is currently online.
Process ID: 76101
Uptime: 10 seconds
Geode Version: 1.14.4
Java Version: 1.8.0_351
Log File: /Users/jblum/pivdev/lab/LocatorOne/LocatorOne.log
JVM Arguments: -Dgemfire.enable-cluster-configuration=true -Dgemfire.load-cluster-configuration-from-dir=false -Dgemfire.launcher.registerSignalHandlers=true -Djava.awt.headless=true -Dsun.rmi.dgc.server.gcInterval=9223372036854775806
Class-Path: /Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-core-1.14.4.jar:/Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-dependencies.jar
Successfully connected to: JMX Manager [host=10.99.199.19, port=1099]
Cluster configuration service is up and running.
2. Executing - start locator --name=LocatorTwo --port=12480
Starting a Geode Locator in /Users/jblum/pivdev/lab/LocatorTwo...
.....
Locator in /Users/jblum/pivdev/lab/LocatorTwo on 10.99.199.19[12480] as LocatorTwo is currently online.
Process ID: 76179
Uptime: 5 seconds
Geode Version: 1.14.4
Java Version: 1.8.0_351
Log File: /Users/jblum/pivdev/lab/LocatorTwo/LocatorTwo.log
JVM Arguments: -Dgemfire.default.locators=10.99.199.19[11235] -Dgemfire.enable-cluster-configuration=true -Dgemfire.load-cluster-configuration-from-dir=false -Dgemfire.launcher.registerSignalHandlers=true -Djava.awt.headless=true -Dsun.rmi.dgc.server.gcInterval=9223372036854775806
Class-Path: /Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-core-1.14.4.jar:/Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-dependencies.jar
3. Executing - list members
Member Count : 2
Name | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001
************************* Execution Summary ***********************
Script file: /Users/jblum/pivdev/spring-data-examples-workspace/spring-data-geode-p2p-example/src/bin/gfsh/twoLocatorCluster.gfsh
Command-1 : start locator --name=LocatorOne --port=11235
Status : PASSED
Command-2 : start locator --name=LocatorTwo --port=12480
Status : PASSED
Command-3 : list members
Status : PASSED
The essential output at this point is:
gfsh>list members
Member Count : 2
Name | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001
Next, we can start our 2 Spring (SDG) configured and bootstrapped Apache Geode servers.
The source code for the Spring configured Apache Geode server is available here. I use the same class to launch both servers.
Additionally, you can see that I declared a Region bean definition in Spring Java config.
You do not need to use Spring [Data Geode] XML configuration to configure your Apache Geode servers. SDG's Annotation-based configuration model is equally powerful and perhaps even more robust and flexible.
I simply need to declare the type of application, as a peer Cache
instance using SDG's @PeerCacheApplication
annotation.
I then use configuration properties to vary the peer server node "name" (here) as well as set the Locators (here) used to join the cluster.
In my case, I simply created 2 run profiles in my IDE where the servers would run with:
$ java --classpath ...
-Dspring.data.gemfire.name=ServerOne
-Dspring.data.gemfire.locators=localhost[11235],localhost[12480]
org.example.spring.geode.server.SpringApacheGeodeServerApplication
The only difference in server 2's run configuration would be the name, set to "ServerTwo". With Apache Geode, peer nodes in a cluster must be uniquely named. This is an Apache Geode requirement.
Additionally, the server process would terminate immediately if not for the block(..)
method (see here, then here).
An Apache Geode server terminates immediately because there is no non-daemon Thread running in an Apache Geode server that blocks and prevents the JVM process from exiting. This is even true when using Apache Geode's API (specifically, the CacheFactory
to configure and bootstrap a peer Cache
) directly, without Spring.
Starting non-CacheServers
in Gfsh do block because they do something similar (and here) to what I had to do in code (I know because I wrote it, ;-P).
NOTE: This is not true for
CacheServer
instances because they do block. They block because they start non-daemon Thread(s) listening for Socket connections from clients (ClientCache
instances), therefore do not require any custom logic to prevent the server from terminating prematurely (see here). Anyway, onward...
Output will appear as follows when starting up the Spring configured/bootstrapped Apache Geode servers:
2022-11-14 16:28:49,482 INFO e.logging.internal.LoggingProviderLoader: 75 - Using org.apache.geode.logging.internal.SimpleLoggingProvider for service org.apache.geode.logging.internal.spi.LoggingProvider
2022-11-14 16:28:49,664 INFO he.geode.logging.internal.LoggingSession: 100 -
---------------------------------------------------------------------------
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with this
work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License.
---------------------------------------------------------------------------
Build-Id: dickc 0
Build-Java-Vendor: BellSoft
Build-Java-Version: 1.8.0_272
Build-Platform: Linux 4.15.0-122-generic amd64
Product-Name: Apache Geode
Product-Version: 1.14.4
Source-Date: 2022-03-02 14:27:08 -0800
Source-Repository: support/1.14
Source-Revision: a45cbe90920c5ea9988a1674e729499abb150db1
Running on: /10.99.199.19, 12 cpu(s), x86_64 Mac OS X 12.6
Communications version: 125
Process ID: 76212
User: jblum
Current dir: /Users/jblum/pivdev/spring-data-examples-workspace/spring-data-geode-p2p-example
Home dir: /Users/jblum
Command Line Parameters:
-Dspring.data.gemfire.name=ServerTwo
-Dspring.data.gemfire.locators=localhost[11235],localhost[12480]
-javaagent:/Applications/IntelliJ IDEA 2021.3.3 CE.app/Contents/lib/idea_rt.jar=55498:/Applications/IntelliJ IDEA 2021.3.3 CE.app/Contents/bin
-Dfile.encoding=UTF-8
Class Path:
...
..
.
2022-11-14 16:28:49,671 INFO he.geode.logging.internal.LoggingSession: 104 - Startup Configuration:
### GemFire Properties defined with api ###
disable-auto-reconnect=true
locators=localhost[11235],localhost[12480]
log-level=config
mcast-port=0
name=ServerTwo
use-cluster-configuration=false
### GemFire Properties using default values ###
ack-severe-alert-threshold=0
ack-wait-threshold=15
archive-disk-space-limit=0
...
..
.
2022-11-14 16:28:49,676 INFO he.geode.internal.InternalDataSerializer: 416 - initializing InternalDataSerializer with 8 services
2022-11-14 16:28:49,696 INFO buted.internal.ClusterOperationExecutors: 188 - Serial Queue info : THROTTLE_PERCENT: 0.75 SERIAL_QUEUE_BYTE_LIMIT :41943040 SERIAL_QUEUE_THROTTLE :31457280 TOTAL_SERIAL_QUEUE_BYTE_LIMIT :83886080 TOTAL_SERIAL_QUEUE_THROTTLE :31457280 SERIAL_QUEUE_SIZE_LIMIT :20000 SERIAL_QUEUE_SIZE_THROTTLE :15000
2022-11-14 16:28:49,827 INFO ributed.internal.membership.gms.Services: 201 - Starting membership services
Nov 14, 2022 4:28:49 PM org.jgroups.protocols.UNICAST3 init
INFO: both the regular and OOB thread pools are disabled; UNICAST3 could be removed (JGRP-2069)
2022-11-14 16:28:49,886 INFO ributed.internal.membership.gms.Services: 575 - Established local address 10.99.199.19(ServerTwo:76212):41003
2022-11-14 16:28:49,887 INFO ributed.internal.membership.gms.Services: 406 - JGroups channel created (took 60ms)
2022-11-14 16:28:49,909 INFO istributed.internal.direct.DirectChannel: 148 - GemFire P2P Listener started on /10.99.199.19:56980
2022-11-14 16:28:49,911 INFO ributed.internal.membership.gms.Services: 714 - Started failure detection server thread on /10.99.199.19:59614.
2022-11-14 16:28:49,924 INFO ributed.internal.membership.gms.Services:1194 - received FindCoordinatorResponse(coordinator=10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, fromView=true, viewId=2, registrants=[10.99.199.19(ServerTwo:76212):41003], senderId=10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, network partition detection enabled=true, locators preferred as coordinators=true, view=View[10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000|2] members: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}]) from locator HostAndPort[localhost:11235]
2022-11-14 16:28:49,924 INFO ributed.internal.membership.gms.Services:1197 - Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join
...
..
.
2022-11-14 16:28:50,246 INFO ributed.internal.membership.gms.Services:1767 - Finished joining (took 334ms).
2022-11-14 16:28:50,247 INFO uted.internal.ClusterDistributionManager: 473 - Starting DistributionManager 10.99.199.19(ServerTwo:76212)<v3>:41003. (took 540 ms)
2022-11-14 16:28:50,247 INFO uted.internal.ClusterDistributionManager: 686 - Initial (distribution manager) view, View[10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000|3] members: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}, 10.99.199.19(ServerTwo:76212)<v3>:41003]
2022-11-14 16:28:50,274 INFO uted.internal.ClusterDistributionManager: 611 - Member 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 is equivalent or in the same redundancy zone.
2022-11-14 16:28:50,274 INFO uted.internal.ClusterDistributionManager: 611 - Member 10.99.199.19(ServerOne:76210)<v2>:41002 is equivalent or in the same redundancy zone.
2022-11-14 16:28:50,274 INFO uted.internal.ClusterDistributionManager: 611 - Member 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001 is equivalent or in the same redundancy zone.
2022-11-14 16:28:50,276 INFO uted.internal.ClusterDistributionManager: 389 - DistributionManager 10.99.199.19(ServerTwo:76212)<v3>:41003 started on localhost[12480],localhost[11235]. There were 3 other DMs. others: [10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002, 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000] (took 593 ms)
...
..
.
2022-11-14 16:28:51,212 INFO e.geode.internal.cache.DistributedRegion:1637 - Initialization of region PdxTypes completed
2022-11-14 16:28:51,222 INFO gframework.data.gemfire.CacheFactoryBean: 281 - Created new Apache Geode version [1.14.4] Cache [ServerTwo]
2022-11-14 16:28:51,223 INFO gframework.data.gemfire.CacheFactoryBean: 281 - Connected to Distributed System [ServerTwo] as Member [10.99.199.19(ServerTwo:76212)<v3>:41003] in Group(s) [[]] with Role(s) [[]] on Host [10.99.199.19] having PID [76212]
2022-11-14 16:28:51,236 INFO data.gemfire.ReplicatedRegionFactoryBean: 281 - Creating Region [ExampleReplicateRegion] in Cache [ServerTwo]
2022-11-14 16:28:51,242 INFO data.gemfire.ReplicatedRegionFactoryBean: 281 - Created Region [ExampleReplicateRegion]
2022-11-14 16:28:51,245 WARN e.geode.internal.cache.DistributedRegion: 217 - Region ExampleReplicateRegion is being created with scope DISTRIBUTED_NO_ACK but enable-network-partition-detection is enabled in the distributed system. This can lead to cache inconsistencies if there is a network failure.
2022-11-14 16:28:51,245 INFO e.geode.internal.cache.DistributedRegion:1135 - Initializing region ExampleReplicateRegion
2022-11-14 16:28:51,248 INFO ode.internal.cache.InitialImageOperation: 514 - Region ExampleReplicateRegion requesting initial image from 10.99.199.19(ServerOne:76210)<v2>:41002
2022-11-14 16:28:51,249 INFO ode.internal.cache.InitialImageOperation: 586 - ExampleReplicateRegion is done getting image from 10.99.199.19(ServerOne:76210)<v2>:41002. isDeltaGII is false
2022-11-14 16:28:51,249 INFO e.geode.internal.cache.DistributedRegion:1637 - Initialization of region ExampleReplicateRegion completed
Peer Cache instance [ServerTwo] running and connected to cluster [localhost[11235],localhost[12480]]
Press <enter> to stop...
NOTE: Output is from the first server I started; output from the second server should be similar.
After the servers successfully start, then we can list members
in Gfsh to see the completed Apache Geode cluster:
gfsh>list members
Member Count : 4
Name | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001
ServerOne | 10.99.199.19(ServerOne:76210)<v2>:41002
ServerTwo | 10.99.199.19(ServerTwo:76212)<v3>:41003
Additional, we can list regions
in our cluster:
gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion
The ExampleReplicateRegion
was created with Spring configuration using a Region bean definition when bootstrapping the server(s).
We can describe region
as follows:
gfsh>describe region --name=/ExampleReplicateRegion
Name : ExampleReplicateRegion
Data Policy : replicate
Hosting Members : ServerTwo
ServerOne
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
------ | ----------- | ---------
Region | size | 0
| data-policy | REPLICATE
We can clearly see that the "ExampleReplicateRegion" Region is a REPLICATE
and that the Region is hosted on "ServerOne" and "ServerTwo".
So far so good!
Now, I am going to kill the "LocatorOne" process in a separate terminal to simulate a crash.
In the output when we started the Locators using Gfsh, we know "LocatorOne's" process ID (76101
):
Locator in /Users/jblum/pivdev/lab/LocatorOne on 10.99.199.19[11235] as LocatorOne is currently online.
Process ID: 76101
Uptime: 10 seconds
...
So, I can run:
$ kill -9 76101
You should see output similar to the following (from either Spring peer Cache server logs):
...
2022-11-14 16:40:51,639 INFO ributed.internal.membership.gms.Services:1273 - Performing availability check for suspect member 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 reason=member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:51,642 INFO rnal.tcpserver.AdvancedSocketCreatorImpl: 104 - Failed to connect to /10.99.199.19:50302
2022-11-14 16:40:51,643 INFO ributed.internal.membership.gms.Services:1314 - Availability check failed for member 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000
2022-11-14 16:40:51,646 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:51,646 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.19(ServerOne:76210)<v2>:41002 for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:51,646 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: failed availability check
2022-11-14 16:40:51,647 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001 for 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 16:40:55,274 WARN org.apache.geode.internal.tcp.TCPConduit: 771 - Attempting TCP/IP reconnect to 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000
2022-11-14 16:40:55,275 INFO org.apache.geode.internal.tcp.Connection:1051 - Connection: shared=false ordered=true failed to connect to peer 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 because: java.net.ConnectException: Connection refused
2022-11-14 16:40:56,707 INFO ributed.internal.membership.gms.Services:1500 - received new view: View[10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001|10] members: [10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}, 10.99.199.19(ServerTwo:76212)<v3>:41003] crashed: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000]
old view is: View[10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000|3] members: [10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000, 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001, 10.99.199.19(ServerOne:76210)<v2>:41002{lead}, 10.99.199.19(ServerTwo:76212)<v3>:41003]
2022-11-14 16:40:56,710 INFO uted.internal.ClusterDistributionManager:1907 - Member at 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 unexpectedly left the distributed cache: departed membership view
2022-11-14 16:40:57,280 INFO org.apache.geode.internal.tcp.TCPConduit: 852 - Ending reconnect attempt because 10.99.199.19(LocatorOne:76101:locator)<ec><v0>:41000 has disappeared.
...
Because I started "LocatorOne" first, it became the "Coordinator" in the cluster and also was the node that ran the Manager that Gfsh was connected to, so naturally, Gfsh got disconnected as well.
gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>
However, we can reconnect to "LocatorTwo", which I suspect became the new "Coordinattor" for the remaining cluster.
gfsh>connect --locator=localhost[12480]
Connecting to Locator at [host=localhost, port=12480] ..
Connecting to Manager at [host=10.99.199.19, port=1099] ..
Successfully connected to: [host=10.99.199.19, port=1099]
You are connected to a cluster of version: 1.14.4
And, as I was suspecting, all the members ("LocatorTwo", started with Gfsh), along with the 2 servers that I configured and bootstrapped with Spring remain:
gfsh>list members
Member Count : 3
Name | Id
---------- | ------------------------------------------------------------------
LocatorTwo | 10.99.199.19(LocatorTwo:76179:locator)<ec><v1>:41001 [Coordinator]
ServerOne | 10.99.199.19(ServerOne:76210)<v2>:41002
ServerTwo | 10.99.199.19(ServerTwo:76212)<v3>:41003
As does the Region and the servers hosting it:
gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion
gfsh>describe region --name=/ExampleReplicateRegion
Name : ExampleReplicateRegion
Data Policy : replicate
Hosting Members : ServerTwo
ServerOne
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
------ | ----------- | ---------
Region | size | 0
| data-policy | REPLICATE
So, all works as I would expect.
If there were a problem, then I would also suspect that this problem would exist with or without Spring, in other words, if the peer Cache
application servers were simply configured and bootstrapped (coded directly) with Apache Geode's API, the issue would occur.
Solid explanation as always and it all makes sense.
I will be running everything on my
localhost
I think that is the problem. I saw no issues when you run both the locators on the same machine.
Maybe you can try with different machines?
Currently, I do not have the resources and access to run this example on multiple machines. But, I would also expect it to work just the same. I can look more into this tomorrow.
If it does not work then:
1) There maybe a network issue 2) Then it might possibly be a bug in Apache Geode 3) Etc...
Spring (Data Geode) is not doing anything special in this case. It isn't opening Sockets or connecting to Apache Geode servers using custom code in SDG. It is simply passing configuration to the Apache Geode server instance during construction, using Apache Geode's API, and Apache Geode is doing all the work.
SDG, specifically with respect to configuration, is simply a facade around Apache Geode's API to apply Spring's conventions and robust programming model to configure Apache Geode in a Spring [Boot] context with ease.
For example, the following class is equivalent to what Spring Data Geode does, to an extent, but simply using Apache Geode's API, not Spring. So, you should see the same result in either case.
NOTE: In this case, using just the
ApacheGeodeServerApplication
class, I am using Apache Geode properties to configure the nodes on startup, namely:gemfire.name
andgemfire.locators
JVM System properties.
Got it. Looks like you are hinting at creating a Geode ticket. And if so, I dont have issues.
Let me try to reproduce this outside of SDG.
Thank you.
I will see if someone on the GemFire team is available to run this example in a multi-host capacity tomorrow. In the meantime, I have dug up an old work laptop of mine to see if I can run this example across both machines.
Will report back when I know more, either way.
Ok, I have got my 2nd machine up and running.
Replaying the scenario...
I start "LocatorOne" on host 1 (10.99.199.19):
gfsh>start locator --name=LocatorOne --port=11235
Starting a Geode Locator in /Users/jblum/pivdev/lab/LocatorOne...
........
Locator in /Users/jblum/pivdev/lab/LocatorOne on 10.99.199.19[11235] as LocatorOne is currently online.
Process ID: 77993
Uptime: 10 seconds
Geode Version: 1.14.4
Java Version: 1.8.0_351
Log File: /Users/jblum/pivdev/lab/LocatorOne/LocatorOne.log
JVM Arguments: -Dgemfire.enable-cluster-configuration=true -Dgemfire.load-cluster-configuration-from-dir=false -Dgemfire.launcher.registerSignalHandlers=true -Djava.awt.headless=true -Dsun.rmi.dgc.server.gcInterval=9223372036854775806
Class-Path: /Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-core-1.14.4.jar:/Users/jblum/pivdev/apache-geode-1.14.4/lib/geode-dependencies.jar
Successfully connected to: JMX Manager [host=10.99.199.19, port=1099]
Cluster configuration service is up and running.
I then start my second Locator on host 2 (10.99.199.20) with the following Gfsh start locator
command (output is similar):
gfsh> start locator --name=LocatorTwo --port=12480 --locators=10.99.199.19[11235]
...
I then list members
(from both hosts, output looks similar):
gfsh>list members
Member Count : 2
Name | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000
NOTE: Notice the IP addresses.
I then restart my Spring configured/bootstrapped servers.
Once started successfully, I again list members
(from both hosts, output is similar):
gfsh>list members
Member Count : 4
Name | Id
---------- | ------------------------------------------------------------------
LocatorOne | 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 [Coordinator]
LocatorTwo | 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000
ServerOne | 10.99.199.19(ServerOne:78107)<v2>:41001
ServerTwo | 10.99.199.19(ServerTwo:78119)<v3>:41002
NOTE: I simply run both Spring configured/bootstrapped server instances on the same host, with "LocatorOne". Again, notice the IP addresses.
I then list regions
and describe region
(from both hosts, output looks similar):
gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion
gfsh>describe region --name=/ExampleReplicateRegion
Name : ExampleReplicateRegion
Data Policy : replicate
Hosting Members : ServerTwo
ServerOne
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
------ | ----------- | ---------
Region | size | 0
| data-policy | REPLICATE
Now, I kill "LocatorOne" on host 1 (10.99.199.19):
$ kill -9 77993
Once again, the logs from the servers looks the same:
Press <enter> to stop...
2022-11-14 18:32:31,039 INFO ributed.internal.membership.gms.Services:1273 - Performing availability check for suspect member 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 reason=member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:31,042 INFO rnal.tcpserver.AdvancedSocketCreatorImpl: 104 - Failed to connect to /10.99.199.19:47132
2022-11-14 18:32:31,042 INFO ributed.internal.membership.gms.Services:1314 - Availability check failed for member 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000
2022-11-14 18:32:31,045 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:31,045 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.19(ServerOne:78107)<v2>:41001 for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:31,045 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from myself for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: failed availability check
2022-11-14 18:32:31,329 INFO ributed.internal.membership.gms.Services:1196 - received suspect message from 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000 for 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000: member unexpectedly shut down shared, unordered connection
2022-11-14 18:32:34,156 WARN org.apache.geode.internal.tcp.TCPConduit: 771 - Attempting TCP/IP reconnect to 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000
2022-11-14 18:32:34,158 INFO org.apache.geode.internal.tcp.Connection:1051 - Connection: shared=false ordered=true failed to connect to peer 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 because: java.net.ConnectException: Connection refused
2022-11-14 18:32:36,160 WARN org.apache.geode.internal.tcp.Connection:1028 - Connection: Attempting reconnect to peer 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000
2022-11-14 18:32:36,345 INFO ributed.internal.membership.gms.Services:1500 - received new view: View[10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000|10] members: [10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000, 10.99.199.19(ServerOne:78107)<v2>:41001{lead}, 10.99.199.19(ServerTwo:78119)<v3>:41002] crashed: [10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000]
old view is: View[10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000|3] members: [10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000, 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000, 10.99.199.19(ServerOne:78107)<v2>:41001{lead}, 10.99.199.19(ServerTwo:78119)<v3>:41002]
2022-11-14 18:32:36,349 INFO uted.internal.ClusterDistributionManager:1907 - Member at 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 unexpectedly left the distributed cache: departed membership view
2022-11-14 18:32:38,165 INFO org.apache.geode.internal.tcp.TCPConduit: 852 - Ending reconnect attempt because 10.99.199.19(LocatorOne:77993:locator)<ec><v0>:41000 has disappeared.
Gfsh was disconnected on host 1 (10.99.199.19):
gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>
No longer connected to 10.99.199.19[1099].
gfsh>
However, I can reconnect to the cluster on host 2 (10.99.199.20) from host 1:
gfsh>connect --locator=10.99.199.20[12480]
Connecting to Locator at [host=10.99.199.20, port=12480] ..
Connecting to Manager at [host=10.99.199.20, port=1099] ..
Successfully connected to: [host=10.99.199.20, port=1099]
You are connected to a cluster of version: 1.14.4
And, everything seems intact:
gfsh>list members
Member Count : 3
Name | Id
---------- | -----------------------------------------------------------------
LocatorTwo | 10.99.199.20(LocatorTwo:1893:locator)<ec><v1>:41000 [Coordinator]
ServerOne | 10.99.199.19(ServerOne:78107)<v2>:41001
ServerTwo | 10.99.199.19(ServerTwo:78119)<v3>:41002
gfsh>
gfsh>
gfsh>list regions
List of regions
----------------------
ExampleReplicateRegion
gfsh>
gfsh>
gfsh>describe region --name=/ExampleReplicateRegion
Name : ExampleReplicateRegion
Data Policy : replicate
Hosting Members : ServerTwo
ServerOne
Non-Default Attributes Shared By Hosting Members
Type | Name | Value
------ | ----------- | ---------
Region | size | 0
| data-policy | REPLICATE
This output is the same on host 2 (10.99.199.20) running "LocatorTwo".
So, it would appear to work on my end.
For complete coverage, I also tested this scenario with the pure Apache Geode API configured/bootstrapped servers as well.
It all worked as expected.
You are creating both locators from the same machine. Maybe that is what is missing in my case. Im creating them separately. Im going to give this a shot.
You are creating both locators from the same machine. Maybe that is what is missing in my case. Im creating them separately. Im going to give this a shot.
Wrong. I see you started second locator on a different machine and gave it details about first.
And it worked this time!! My understanding was wrong.
Thanks.
Hi, I have two standalone locators (let's call them L1 and L2) running on two different machines. I have two instances of my app that connect to L1 on startup. But if I bring down L1, I expect them to show up under L2. But that doesnt seem to happen.
Is that how it's suppose to work or Im misunderstanding this?
Env Details
locators
config.How to reproduce?
gfsh
. Let's call them L1 and L2 respectively.locators
config.Expectation
None of those expectations are met.
One interesting note, if I start another peer, it shows up in L2 (because L1 is down) but the regions are NOT replicated. So at this point, the last peer (and peers started after) has its own replicated region. So I end up having two disjoint replicated regions which is problematic.