gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.51k stars 1.07k forks source link

qemu-img create via libfapi is slow #4320

Open resposit opened 1 month ago

resposit commented 1 month ago

Description of problem:

Creating a raw/qcow2 on a gluster volume via libgfapi appears to be very slow compare to fuse access.

# time qemu-img create gluster://10.10.81.16/vstor/10G-gfapi 10G
Formatting 'gluster://10.10.81.16/vstor/10G-gfapi', fmt=raw size=10737418240
[2024-03-20 09:50:35.293625 +0000] I [io-stats.c:3784:ios_sample_buf_size_configure] 0-vstor: Configure ios_sample_buf  size is 1024 because ios_sample_interval is 0
[2024-03-20 09:50:47.299242 +0000] I [io-stats.c:4190:fini] 0-vstor: io-stats translator unloaded

real    0m13.046s
user    0m0.018s
sys     0m0.028s

If I do the same operation directly on the fuse mount point, the operation is almost istantaneous:

# mount | grep glust
cloud15-gl.na.infn.it:vstor on /mnt/pve/vstor type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

root@cloud15:/mnt/pve/vstor# time qemu-img create ./10G 10G
Formatting './10G', fmt=raw size=10737418240

real    0m0.033s
user    0m0.004s
sys     0m0.022s

Showing both files:

root@cloud15:/mnt/pve/vstor# ls -l 10G*
-rw-r--r-- 1 root root 10737418240 Mar 20 10:55 10G
-rw------- 1 root root 10737418240 Mar 20 10:30 10G-gfapi

Mandatory info: - The output of the gluster volume info command:

Volume Name: vstor
Type: Disperse
Volume ID: c3d389ed-3d51-4484-bdfd-596d2629a0a1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: cloud15-gl:/pool6/vstor/brick
Brick2: cloud16-gl:/pool6/vstor/brick
Brick3: cloud17-gl:/pool6/vstor/brick
Options Reconfigured:
features.shard-block-size: 256MB
transport.address-family: inet
storage.fips-mode-rchecksum: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: enable
features.shard: on
user.cifs: off
client.event-threads: 4
server.event-threads: 4
performance.client-io-threads: on
cluster.lookup-optimize: off

- The output of the gluster volume status command:

Status of volume: vstor
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick cloud15-gl:/pool6/vstor/brick         59554     0          Y       3111
Brick cloud16-gl:/pool6/vstor/brick         52485     0          Y       3104
Brick cloud17-gl:/pool6/vstor/brick         52170     0          Y       3116
Self-heal Daemon on localhost               N/A       N/A        Y       3165
Self-heal Daemon on cloud17-gl.na.infn.it   N/A       N/A        Y       3151
Self-heal Daemon on cloud16-gl.na.infn.it   N/A       N/A        Y       3142

Task Status of Volume vstor
------------------------------------------------------------------------------
There are no active volume tasks

Additional info:

- The operating system / glusterfs version: Debian 12.5 Glusterfs 11.1

Bricks are on ZFS file sytems:

root@cloud15:/mnt/pve/vstor# zpool status
  pool: pool6
 state: ONLINE
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Mon Mar 18 11:42:02 2024
config:

        NAME                        STATE     READ WRITE CKSUM
        pool6                       ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            scsi-35000c500f64eeaf7  ONLINE       0     0     0
            scsi-35000c500f64f369b  ONLINE       0     0     0
            scsi-35000c500f65032fb  ONLINE       0     0     0
            scsi-35000c500c2254b77  ONLINE       0     0     0
            scsi-35000c500cf057b2b  ONLINE       0     0     0
        logs
          mirror-1                  ONLINE       0     0     0
            scsi-358ce38ee22c8e51d  ONLINE       0     0     0
            scsi-358ce38ee22c8e519  ONLINE       0     0     0

errors: No known data errors
aravindavk commented 1 month ago

GFAPI based command (qemu-img create gluster:) is mounting the volume and unmounting the volume along with the file creation. Fuse mount those two steps are skipped and only file create was done.

Please share the gfapi logs so that we can confirm the timings for each steps.

resposit commented 1 month ago

Thank you, I increased diagnostics.client-log-level to DEBUG. Here is the log: gluster-log.txt

mohit84 commented 1 month ago

Can you share the profile after execute a command like below

gluster v profile <vol-name> start; gluster v profile <vol-name> info clear; time qemu-img create /mnt/pve/vstor/10G 10G;  gluster v profile <vol-name> info
resposit commented 1 month ago

Hello, yes, see the attached log: gfapi.txt fuse.txt

aravindavk commented 1 month ago

GFAPI doesn't help for the stateless operations like create. qemu-img create has to do init (mount) and fini (umount) along with the actual file creation. The mount state will not be maintained while running the next command. If you run qemu-img create 10 times then it has to mount and unmount the volume 10 times.

When the image is booted using qemu-img command, then it is a long running process and the mount will be persisted as long as the image is running. During this time, all the IO operations use GFAPI so that it can be faster than fuse.

Fuse equivalent to the GFAPI code above is

# qemu_fuse_create.sh
mount -t glisters cloud15-gl.na.infn.it:vstor /mnt/pve/vstor
qemu-img create /mnt/pve/vstor/10G 10G
umount /mnt/pve/vstor
time bash qemu_fuse_create.sh
mohit84 commented 1 month ago

The actual image creation via gfapi is fast as we can see the connection was started around 12:47:24.708074 and shutdown was started around 12:47:24.713059 so the total time taken is around 4985 microsec that is faster than fuse has taken. It is showing slowness at application level because shutdown process is slow, we have to check it why it is taking time. I will try to update on the same if i will find something.

[2024-03-20 12:47:24.708074 +0000] I [MSGID: 122062] [ec.c:335:ec_up] 0-vstor-disperse-0: Going UP : Child UP = 111 Child Notify = 111

[2024-03-20 12:47:24.713059 +0000] I [socket.c:835:__socket_shutdown] 0-gfapi: intentional socket shutdown(11)
resposit commented 1 month ago

@aravindavk I tried what you suggest

root@cloud15:~# mkdir vstor-test
root@cloud15:~# cat qemu_fuse_create.sh
mount -t glusterfs cloud15-gl.na.infn.it:vstor ./vstor-test
qemu-img create ./vstor-test/10G 10G
umount ./vstor-test

Execution is almost istantaneous:

root@cloud15:~# time ./qemu_fuse_create.sh
Formatting './vstor-test/10G', fmt=raw size=10737418240

real    0m0.105s
user    0m0.058s
sys     0m0.037s
mohit84 commented 1 month ago

I think i have found a RCA why qemu-img is taking time during connection shutdown. During first fop initiation we do set a call_bail timer 10s and call_bail and take a reference on rpc object . The function(call_bail) is call by timer thread every 10s basis. During disconnect the client job is cancel the timer event and unref the rpc object but after this patch (https://review.gluster.org/#/c/glusterfs/+/22087/) we have changed the return code to -1 in gf_timer_call_cancel if cleanup is started. Because the function is returning -1 to the parent function rpc_clnt_connection_cleanup so it is not unref rpc object and the connection has to wait until call_bail function is triggered by timer thread even the job operation has finished completely. I think we have to change the return code to avoid an issue. Though the patch was implemented as a part of shd mulitplex feature even shd mux feature was reverted but the associated patch was not revert.

@rafikc30 Do you think any issue if we do change the return code to 0 while ctx is valid in the function gf_timer_call_cancel so that a parent function can uref rpc object without waiting for call_bail.