Georeplication : unpredictable mapping between primary nodes and secondary nodes

FSangouard commented 3 months ago

Description of problem:

When using georeplication between two clusters, one would expect that the mapping between nodes of the primary and the secondary could be deduced from the list of bricks for the replicated volume on each cluster. For exemple, I have a 3-node cluster A and a 3-node cluster B, and a replicated volume with one brick per node in each cluster. If the list of bricks for the volume in cluster A goes like this:

node1A node2A node3A

and the list in cluster B goes like this:

node1B node2B node3B

I would expect the georeplication session to open connections between the nodes like this:

node1A > node1B node2A > node2B node3A > node3B

However, that is not guaranteed because in monitor.py a set is created from the list of secondary bricks, which may change the order of the items. According to my tests, the order is not random because it changed only when I recreated the secondary cluster, it remained the same across restarts of the georeplication session, so I think it is based on some hash of the values, which is hard to predict, and above all, not controllable by the user since the values contain uuids generated during volume creation.

The exact command to reproduce the issue: No single command in particular, just create a georeplication session as per documentation and you should observe this. In case the mapping match, try recreating the volume to reroll new hashes until you see it.

The full output of the command that failed: N/A

Expected results: The mapping between nodes matches what you get when putting side by side the lists of bricks on both clusters as returned by volume info command.

Mandatory info: - The output of the gluster volume info command:

Volume Name: test-georeplication
Type: Replicate
Volume ID: d1c484b0-c54f-424d-a47f-9659943d4aac
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: node1A:/applis/apsu/data/glusterfs/test-georeplication/brick1/brick
Brick2: node2A:/applis/apsu/data/glusterfs/test-georeplication/brick1/brick
Brick3: node3A:/applis/apsu/data/glusterfs/test-georeplication/brick1/brick
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
cluster.data-self-heal: on
cluster.quorum-type: none
cluster.entry-self-heal: on
storage.owner-uid: 2000
cluster.metadata-self-heal: on
storage.owner-gid: 2000

- The output of the gluster volume status command:

Status of volume: test-georeplication
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick node1A:/applis/apsu/data/glusterfs/test-geore
plication/brick1/brick                               49153     0          Y       19755
Brick node2A:/applis/apsu/data/glust
erfs/test-georeplication/brick1/brick       49153     0          Y       16988
Brick node3A:/applis/apsu/data/glust
erfs/test-georeplication/brick1/brick       49153     0          Y       8048
Self-heal Daemon on localhost               N/A       N/A        Y       21433
Self-heal Daemon on node2A                 N/A       N/A        Y       16999
Self-heal Daemon on node3A                 N/A       N/A        Y       9408

Task Status of Volume test-georeplication
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal info command:

Brick node1A:/applis/apsu/data/glusterfs/test-georeplication/brick1/brick
Status: Connected
Number of entries: 0

Brick node2A:/applis/apsu/data/glusterfs/test-georeplication/brick1/brick
Status: Connected
Number of entries: 0

Brick node3A:/applis/apsu/data/glusterfs/test-georeplication/brick1/brick
Status: Connected
Number of entries: 0

- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/ I added some debugging statements in monitor.py to diagnose the problem, here's an excerpt showing what happens:

[2024-01-26 14:41:48.170089] D [monitor(monitor):304:distribute] : master bricks: [{'host': 'node1B', 'uuid': '00295709-488f-429c-bea3-bc6abaaf3c4e', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'}, {'host': 'node2B', 'uuid': 'd3f15caa-6f14-4934-86e2-48528d42db43', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'}, {'host': 'node3B', 'uuid': '8dcc448c-ad81-4285-8e50-c1a3e689f830', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'}] [2024-01-26 14:41:48.170439] D [monitor(monitor):314:distribute] : slave SSH gateway: georep@node1A [2024-01-26 14:41:48.437256] D [monitor(monitor):334:distribute] : slave bricks: [{'host': 'node1A', 'uuid': '27aa6ae7-b0dd-4636-8dad-e5dba7a58342', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'}, {'host': 'node2A', 'uuid': '9a0b50db-df17-4cd3-b765-9e4f378ab155', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'}, {'host': 'node3A', 'uuid': 'ec8d1f05-3a8b-4b86-9f07-99c570ffe07f', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'}] [2024-01-26 14:41:48.437767] D [monitor(monitor):340:distribute] : slavenodes: set([('node3A', 'ec8d1f05-3a8b-4b86-9f07-99c570ffe07f'), ('node2A', '9a0b50db-df17-4cd3-b765-9e4f378ab155'), ('node1A', '27aa6ae7-b0dd-4636-8dad-e5dba7a58342')]) [2024-01-26 14:41:48.437854] D [monitor(monitor):342:distribute] : slaves: [('georep@node3A', 'ec8d1f05-3a8b-4b86-9f07-99c570ffe07f'), ('georep@node2A', '9a0b50db-df17-4cd3-b765-9e4f378ab155'), ('georep@node1A', '27aa6ae7-b0dd-4636-8dad-e5dba7a58342')] len(slaves): 3 [2024-01-26 14:41:48.437916] D [monitor(monitor):346:distribute] : idx: 0 brick: {'host': 'node1B', 'uuid': '00295709-488f-429c-bea3-bc6abaaf3c4e', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'} [2024-01-26 14:41:48.437973] D [syncdutils(monitor):945:is_hot] Volinfo: brickpath: 'node1B:/applis/apsu/data/glusterfs/test-georeplication/brick1/brick' [2024-01-26 14:41:48.438107] D [monitor(monitor):349:distribute] : ('georep@node3A', 'ec8d1f05-3a8b-4b86-9f07-99c570ffe07f') [2024-01-26 14:41:48.438385] D [monitor(monitor):346:distribute] : idx: 1 brick: {'host': 'node2B', 'uuid': 'd3f15caa-6f14-4934-86e2-48528d42db43', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'} [2024-01-26 14:41:48.438443] D [monitor(monitor):346:distribute] : idx: 2 brick: {'host': 'node3B', 'uuid': '8dcc448c-ad81-4285-8e50-c1a3e689f830', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'} [2024-01-26 14:41:48.438497] D [monitor(monitor):354:distribute] : worker specs: [({'host': 'node1B', 'uuid': '00295709-488f-429c-bea3-bc6abaaf3c4e', 'dir': '/applis/apsu/data/glusterfs/test-georeplication/brick1/brick'}, ('georep@node3A', 'ec8d1f05-3a8b-4b86-9f07-99c570ffe07f'), '1', False)] [2024-01-26 14:41:48.441704] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change [{status=Initializing...}]

-Is there any crash ? Provide the backtrace and coredump N/A

Additional info:

I tested replacing the set constructor with the list constructor in monitor.py and I got the expected results, so I think the fix could be quite simple, but maybe there are side effects I do not know about.

The operating system / glusterfs version:

GlusterFS 9.4 (but the affected code is still there in devel) CentOS 7.4

aravindavk commented 3 months ago

It is not required to connect to the secondary nodes in the same order. It can connect to any node in secondary volume and sync. The Primary Volume type and Secondary volume type need not be same. You can create the primary volume as Replica 3 and secondary volume as Arbiter or distributed Replicate with more bricks of small size.

Change detection happens in the Primary nodes and sync always happens through the Gluster mount so connecting node doesn't matter.

Georep monitor process checks if any connection is failed then it tries to connect to other available secondary node to continue syncing.

FSangouard commented 3 months ago

Just to be sure I understand correctly, if for example node1A is connected to node1B, and then node1B fails, node1A will try to connect to node2B or node3B ?

I thought each node remained connected to a single node, but only one worker was active at a time, and if the active worker couldn't sync, another worker would become active in its place.

aravindavk commented 3 months ago

Yes. Only one worker among the Replica bricks will be Active and other two will be Passive since all the bricks will have the same data in Primary Bricks. If the Passive worker goes down then the one among the passive will become Active.

If a worker is Active and failing to sync, then check the respective worker's log file to see if any errors.

FSangouard commented 3 months ago

OK, thank you for the clarification!

gluster / glusterfs

Georeplication : unpredictable mapping between primary nodes and secondary nodes #4300