Open jifengzhou opened 2 months ago
Additional explanation of the problem: Although the file size and md5 value of the data6.img file read in /mnt/ are inconsistent with /root/data6.img, brick: node1:/export/heketi/node_d5071/device_65fa5/data_953e3/data6.img is the same as /root/data6.img. The reason for the inconsistency between /mnt/data6.img and /root/data6.img is the unreasonable update of the extended attribute trusted.glusterfs.shard.file-size. The current question is how to solve the unreasonable update of trusted.glusterfs.shard.file-size
After verification, the problem can be solved by disabling shard_inode_ctx_invalidate during readdirp. Looking at the historical commit records, I saw that shard_inode_ctx_invalidate was added in https://review.gluster.org/#/c/glusterfs/+/12400/. It is to solve a problem related to geo-rep. Geo-rep is not used in our project. Will the solution of "disabling shard_inode_ctx_invalidate during readdirp" bring any new problems?
After shard is enabled, the size of the copied file is inconsistent with the original file.
The volume configuration is as follows: Volume Name: data Type: Replicate Volume ID: 02c625c8-a097-46fd-b913-76a53f286ff7 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: node1:/export/heketi/node_d5071/device_65fa5/data_953e3 Brick2: node3:/export/heketi/node_9d13b/device_dfd0a/data_770ff Brick3: node2:/export/heketi/node_016de/device_56b70/data_762bc Options Reconfigured: performance.write-behind: on diagnostics.brick-log-level: INFO diagnostics.client-log-level: INFO features.shard: on features.shard-block-size: 1024MB user.heketi.id: 85b4bccd7ffd0c6d97658cb5badbe3ae cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off client.event-threads: 1
Mount volume data on node1: mount -t glusterfs node1:/data /mnt
Generate a data6.img file in the /root path: if=/dev/zero of=/root/data6.img bs=128k count=11
View its md5 value and file size: [root@node1 ~]# md5sum /root/data6.img 2aabc019f6b5d881028999f055f5ff14 /root/data6.img [root@node1 ~]# ls -l /root/data6.img -rw-r--r-- 1 root root 1441792 8Mon 14 14:19 /root/data6.img
Copy the data6.img file to the /mnt folder: cp /root/data6.img /mnt/
Check the md5 and file size of the data6.img file in the /mnt path and find that there is a certain probability that the md5 and file size are inconsistent with the original file: [root@node1 ~]# md5sum /mnt/data6.img b98f319ebcfe36f416c0b7d9281f85ff /mnt/data6.img [root@node1 ~]# ls -l /mnt/data6.img -rw-r--r-- 1 root root 2359296 8Mon 14 14:19 /mnt/data6.img
Through log and gdb tracking, it was found that during the file copying process, when shard_common_inode_write_do_cbk->shard_get_delta_size_from_inode_ctx calculated local->delta_size, the ctx->stat.ia_size value changed significantly from the expected value and became significantly smaller, resulting in the calculated local->delta_size being larger than the actual value to be increased. Further tracking revealed that during the file copying process, the ctx->refresh of the file inode was set to _gf_true with a certain probability, resulting in the triggering of shard_lookup_base_file_cbk->shard_inode_ctx_set when the next write was triggered. It was precisely because of this update that the ctx->stat.ia_size value changed, resulting in an error in the calculation of local->delta_size by shard_get_delta_size_from_inode_ctx.
Why is there a certain probability that ctx->refresh of the file inode is set to _gf_true during file writing? In our usage environment, it is likely related to our upper-layer application frequently reading the contents of the /mnt folder. According to the glusterfs shard_readdirp code, it will set ctx->refresh to _gf_true under certain conditions.
Another interesting thing is that I found that the problem does not seem to occur when performance.write-behind is turned off. I don’t know if there is any connection between the two.