gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.54k stars 1.07k forks source link

Problems with Geo-Replication and GlusterFS 10.4 #4265

Closed PYZR closed 2 months ago

PYZR commented 7 months ago

The operating system / glusterfs version: Ubuntu 20.04.5 LTS / glusterfs 10.4

Description of problem: I have following Setup:

+----------------------+
| [GlusterFS Server#1] |192.168.11.10
|       gluster1       +--------------+
|       storage1       |              |
+----------------------+              |
                                      |
+----------------------+              |             +----------------------+                            +----------------------+
| [GlusterFS Server#2] |192.168.11.11 | 192.168.11.1|                      |192.168.12.1   192.168.12.10| [GlusterFS Server#3] |
|       gluster2       +--------------+-------------+       gateway        +-------------+--------------+       gluster7       |
|       storage2       |              |             |                      |                            |       storage7       |
+----------------------+              |             +----------------------+                            +----------------------+
                                      |
+----------------------+              |
| [GlusterFS Server#3] |192.168.11.12 |
|       gluster3       +--------------+
|       storage3       |
+----------------------+

I have a Volume called dev which is replicated on gluster1, gluster2 and gluster3. And a Volume called rep on gluster7.

Everything is configured so that geo-replication would work without any problems.

If I now add the following to the file /etc/glusterfs/glusterd.vol on each node:

option transport.socket.bind-address <node-ip>
option transport.tcp.bind-address <node-ip>

Now I the volumes are only reachable over this IP. What means if I use local fuse mounts i have to use the corresponding <node-ip> instead of localhost to connect to the volumes. Right?

Now I try to create a geo-replication:

root@gluster1:~# gluster vol geo-replication dev storage7::rep create push-pem
Unable to mount and fetch primary volume details. Please check the log: /var/log/glusterfs/geo-replication/gverify-primarymnt.log
geo-replication command failed

since that didn't work I'll have a look at the gverify-primarymnt.log:

[2023-11-07 09:18:36.045629 +0000] I [MSGID: 100030] [glusterfsd.c:2767:main] 0-glusterfs: Started running version [{arg=glusterfs}, {version=10.4}, {cmdlinestr=glusterfs -s localhost --xlator-option=*dht.lookup-unhashed=off --volfile-id dev -l /var/log/glusterfs/geo-replication/gverify-primarymnt.log /tmp/gverify.sh.keg6Z3}] 
[2023-11-07 09:18:36.049175 +0000] I [glusterfsd.c:2447:daemonize] 0-glusterfs: Pid of current running process is 65262
[2023-11-07 09:18:36.061073 +0000] I [MSGID: 101190] [event-epoll.c:667:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}] 
[2023-11-07 09:18:36.061230 +0000] I [MSGID: 101190] [event-epoll.c:667:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}] 
[2023-11-07 09:18:36.061226 +0000] I [glusterfsd-mgmt.c:2673:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost
[2023-11-07 09:18:39.062354 +0000] I [glusterfsd-mgmt.c:2712:mgmt_rpc_notify] 0-glusterfsd-mgmt: Exhausted all volfile servers
[2023-11-07 09:18:39.062913 +0000] W [glusterfsd.c:1458:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libgfrpc.so.0(+0xfef5) [0x7ff83c21fef5] -->glusterfs(+0x153a3) [0x5612e992b3a3] -->glusterfs(cleanup_and_exit+0x58) [0x5612e991f658] ) 0-: received signum (1), shutting down 
[2023-11-07 09:18:39.063490 +0000] I [fuse-bridge.c:7065:fini] 0-fuse: Unmounting '/tmp/gverify.sh.keg6Z3'.
[2023-11-07 09:18:39.064312 +0000] I [fuse-bridge.c:7069:fini] 0-fuse: Closing fuse connection to '/tmp/gverify.sh.keg6Z3'.
[2023-11-07 09:18:39.064508 +0000] W [glusterfsd.c:1458:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7ff83c1d9609] -->glusterfs(glusterfs_sigwaiter+0xcd) [0x5612e991f7dd] -->glusterfs(cleanup_and_exit+0x58) [0x5612e991f658] ) 0-: received signum (15), shutting down

two things jump right out at me:

{cmdlinestr=glusterfs -s localhost --xlator-option=*dht.lookup-unhashed=off --volfile-id dev -l /var/log/glusterfs/geo-replication/gverify-primarymnt.log /tmp/gverify.sh.keg6Z3}

and

[2023-11-07 09:18:36.061226 +0000] I [glusterfsd-mgmt.c:2673:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: localhost

Am I wrong or is it trying to establish a connection to the volume dev via localhost?

I can only think of two ways to change this:

The first way is to modify /etc/hosts:

#127.0.0.1 localhost
<node-ip>  localhost
[...]

or the second and somewhat more complex way:

Modify /usr/libexec/glusterfs/gverify.sh:

[...]
function primary_stats()
{
    [...]
    if [ "$inet6" = "inet6" ]; then
        glusterfs -s localhost --xlator-option="*dht.lookup-unhashed=off" --xlator-option="transport.address-family=inet6" --volfile-id $PRIMARYVOL -l $primary_log_file $d;
    else
        # Modifications
        get_ip="$(cat /etc/glusterfs/glusterd.vol |grep -P "^[^#].+?((25[0-5]|(2[0-4]|1\d|[1-9]|)\d)\.?\b){4}$" | awk '{print $3}' |uniq)"
        volfile_server=""

        if ! [ -z $get_ip ]; then
            if [ "$(echo $get_ip |wc -l)" -eq 1 ]; then
                volfile_server=$get_ip
            else
                volfile_server="localhost"
            fi
        else
            volfile_server="localhost"
        fi
        glusterfs -s $volfile_server --xlator-option="*dht.lookup-unhashed=off" --volfile-id $PRIMARYVOL -l $primary_log_file $d;
        # Modifications END
#       glusterfs -s localhost --xlator-option="*dht.lookup-unhashed=off" --volfile-id $PRIMARYVOL -l $primary_log_file $d;
    fi
    [...] 
}
[...]

If I try now to create a geo-replication it works:

root@gluster1:~# gluster vol geo-replication dev storage7::rep create push-pem
Creating geo-replication session between dev & storage7::rep has been successful

but if I start or try to start the geo-replication:

root@gluster1:/usr/libexec/glusterfs/python/syncdaemon# gluster volume geo-replication dev root@storage7::rep start
Starting geo-replication session between dev & storage7::rep has been successful

But the status remains unchanged at Created:

root@gluster1:~# gluster volume geo-replication dev root@storage7::rep status

PRIMARY NODE    PRIMARY VOL    PRIMARY BRICK         SECONDARY USER    SECONDARY        SECONDARY NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
-----------------------------------------------------------------------------------------------------------------------------------------------
storage1        dev            /storage/brick/dev    root              storage7::rep    N/A               Created    N/A             N/A                  
storage3        dev            /storage/brick/dev    root              storage7::rep    N/A               Created    N/A             N/A                  
storage2        dev            /storage/brick/dev    root              storage7::rep    N/A               Created    N/A             N/A 

If I now take a look at /var/log/glusterfs/geo-replication/dev_storage7_rep/gsyncd.log I see the following:

[2023-11-08 13:04:18.74454] E [syncdutils(monitor):845:errlog] Popen: command returned error [{cmd=/usr/sbin/gluster --xml --remote-host=localhost volume info dev}, {error=1}]

It may have something to do with /usr/libexec/glusterfs/python/syncdaemon/subcmds.py.

But at this point the question arises whether there is really no other way to get a geo-replication with transport.socket.bind-address and transport.tcp.bind-address to work than messing around in the code?