gluster / glusterfs

Gluster Filesystem : Build your distributed storage in minutes
https://www.gluster.org
GNU General Public License v2.0
4.73k stars 1.08k forks source link

[bug:1773476] gluster does not return correct filesize and blocksize after ftruncate #860

Closed gluster-ant closed 4 years ago

gluster-ant commented 4 years ago

URL: https://bugzilla.redhat.com/1773476 Creator: zz.sh.cynthia at gmail Time: 20191118T08:53:40

Created attachment 1637237 execute this file with a glusterfs file as argument

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce: 1. 2. 3.

Actual results:

Expected results:

Additional info:

gluster-ant commented 4 years ago

Time: 20191118T08:55:11 zz.sh.cynthia at gmail commented: During my recent test of glusterfs7 I find a very confusing issue.it is a permanent issue that with enclosed test.c test crash each time(not glusterfs but the test app). Also, ftruncate the fstat does not return correct file size and block size and I think this is a bug of glusterfs7, when I use glusterfs 3.12 there is no such issue.

Test steps Pre-codition: set volume cluster.consistent-metadata: on Just execute test.c choose a glusterfs folder file to be the argument Test.c logic: 1> create a file with fopen 2> fstat the file 3> ftruncate the file 4> mmap the file 5> write to file - crash

./test /mnt/export/testrty

ID of containing device: [0,2a] File type: regular file I-node number: -5612996300533715526 Mode: 100644 (octal) Link count: 1 Ownership: UID=0 GID=615 Preferred I/O block size: 131072 bytes File size: 0 bytes Blocks allocated: 0 Last status change: Mon Nov 18 09:53:40 2019 Last file access: Mon Nov 18 09:53:40 2019 Last file modification: Mon Nov 18 09:53:40 2019 ftruncate is called, size= 52

File block after truncate: 0 File size after truncate: 0 Bus error (core dumped)# = crash happen in snprintf((char *)m_cells, sizeof(struct DataStorageCell), "persistent data %ui", (unsigned int)sizeof(struct DataStorageCell));

[New LWP 7198] Core was generated by `./a.out /mnt/log/test'. Program terminated with signal SIGBUS, Bus error.

0 0x00007f66829c4c5d in vsnprintf () from /lib64/libc.so.6

Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-40.wf30.x86_64 (gdb) bt

0 0x00007f66829c4c5d in vsnprintf () from /lib64/libc.so.6

1 0x00007f66829a4263 in snprintf () from /lib64/libc.so.6

2 0x00000000004015f6 in main ()

(gdb) quit [root@mn-0:/var/lib/systemd/coredump]

cd /home/robot

[root@mn-0:/home/robot]

When I set cluster.consistent-metadata to “off”, this crash issue disappear, and the filesize and blocks is correct!! [root@mn-0:/home/robot]

./test /mnt/export/testrtyer

ID of containing device: [0,2a] File type: regular file I-node number: -5858193331384707222 Mode: 100644 (octal) Link count: 1 Ownership: UID=0 GID=615 Preferred I/O block size: 131072 bytes File size: 0 bytes Blocks allocated: 0 Last status change: Mon Nov 18 09:55:57 2019 Last file access: Mon Nov 18 09:55:57 2019 Last file modification: Mon Nov 18 09:55:57 2019 ftruncate is called, size= 52

File block after truncate: 1 File size after truncate: 52 [root@mn-0:/home/robot]

After I study the code I think the difference is when consistent metadata option is on __afr_inode_write_cbk will clean the prebuf and postbuf, this cause the fuse get all-zero prebuf and postbuf , so fstat get 0 filesize and blocks, and cause app crashed in snprintf . This issue is permanent in my test env(glusterfs7.0 community code), could you please try this test.c I think this is a bug, in glusterfs3.12 version there is no such issue.

Some config in my env: [root@mn-0:/home/robot]

gluster v info export

Volume Name: export Type: Replicate Volume ID: 69b29303-36e0-4146-b4d5-adf0dcbe3a48 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: mn-0.local:/mnt/bricks/export/brick Brick2: mn-1.local:/mnt/bricks/export/brick Brick3: dbm-0.local:/mnt/bricks/export/brick Options Reconfigured: cluster.entry-self-heal: on cluster.data-self-heal: on cluster.metadata-self-heal: on performance.client-io-threads: off cluster.heal-timeout: 60 cluster.favorite-child-policy: mtime network.ping-timeout: 42 server.allow-insecure: on cluster.consistent-metadata: on cluster.quorum-reads: true cluster.quorum-type: auto cluster.server-quorum-type: none transport.address-family: inet storage.fips-mode-rchecksum: on nfs.disable: on diagnostics.client-log-level: INFO cluster.server-quorum-ratio: 51%

gluster-ant commented 4 years ago

Time: 20191120T06:28:58 ravishankar at redhat commented: Looks like a fuse (kernel) caching issue to me. If I mount the volume with attribute-timeout and entry-timeout set to zero, I don't see the crash and I get the correct stat size Can you confirm if this works?

  1. mount -t glusterfs -o attribute-timeout=0 -o entry-timeout=0 IP:volname /mout-point
  2. Run your test.c program.
gluster-ant commented 4 years ago

Time: 20191120T06:40:09 ravishankar at redhat commented: (In reply to Ravishankar N from comment #2)

Looks like a fuse (kernel) caching issue to me. But since there is no crash on glusterfs-3.12 using the same fuse kernel bits, there might be something more to it in gluster itself that needs further investigation.

gluster-ant commented 4 years ago

Time: 20191125T05:32:13 zz.sh.cynthia at gmail commented: ok, i will test

gluster-ant commented 4 years ago

Time: 20191125T05:58:07 ravishankar at redhat commented: (In reply to zhou lin from comment #4)

ok, i will test

Thanks. I haven't found what the problem is yet but in case if this helps you debug yourself, I found a fairly simple reproducer: The problem is that if you immediately do a stat after truncate (to non-zero size), it is getting size as zero as you observed. But if you wait for 1 second, it gets the right size.

[root@vm1 fuse_mnt]# TIME=0 [root@vm1 fuse_mnt]# FILE=file1

[root@vm1 fuse_mnt]# truncate -s 52 $FILE && sleep $TIME && stat -c "%s" $FILE 0

[root@vm1 fuse_mnt]# TIME=1 [root@vm1 fuse_mnt]# truncate -s 52 $FILE && sleep $TIME && stat -c "%s" $FILE 52

I also found that if you disable 'readdir-ahead' translator, you get the correct size even if you mount without attribute-timeout=0 and entry-timeout=0.

gluster-ant commented 4 years ago

Time: 20191125T06:08:38 zz.sh.cynthia at gmail commented: hi, i've test mouting with options, this works, the coredump issue is gone. however, i am wondering why in glusterfs3.12.15 there is no such issue even when we set cluster.consisntent-metadata to "on" ? from user pov, we usually do not konw use extra mount options, so this seems like a bug for me.

gluster-ant commented 4 years ago

Time: 20191125T11:33:15 kkeithle at redhat commented: The commmon-ha component is for the ganesha HA bits.

gluster-ant commented 4 years ago

Time: 20191128T06:37:31 ravishankar at redhat commented: (In reply to Ravishankar N from comment #5)

So the reason it is not happening in 3.12.x is because it did not have the function rda_inode_ctx_update_iatts() introduced by the commit "*c9bde3021 - performance/readdir-ahead: keep stats of cached dentries in sync with modifications"

What is happening is this: -When consistent-metadata is enabled, afr calls afr_zero_fill_stat() in its truncate call back and sets ctime and nlink to zero in the postbuf. This behaviour is already there in glusterfs 3.12 This forces fuse to trigger a lookup which hits AFR (and thus the bricks) and the application gets the current ia_size when it does the stat. -But in the current code-base, when ctime is zero, the function rda_inode_ctx_update_iatts() zeroes out everything else also in the postbuf except gfid and ia_type. This somehow is not causing FUSE to trigger a lookup.

Raghavendra, is there some field in the postbuf that we need to retain and not zero out (in addition to the gfid and ia_type) in rda_inode_ctx_update_iatts() so that fuse sends a lookup when application does a stat?

gluster-ant commented 4 years ago

Time: 20191128T06:41:18 ravishankar at redhat commented: Example postbuf values from truncate cbks as seen in gdb:

Sent by AFR to readdir-ahead:

$6 = {ia_flags = 6143, ia_ino = 10578400756749423401, ia_dev = 64768, ia_rdev = 0, ia_size = 52, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_blksize = 4096, ia_blocks = 0, ia_atime = 1574661108, ia_mtime = 1574853910, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 650755047, ia_mtime_nsec = 928005039, ia_ctime_nsec = 928005039, ia_btime_nsec = 0, ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = "\004\343F\357\225sB\346\222\316\aq\005\305\323)", ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 0 '\000'}, group = {read = 1 '\001', write = 0 '\000', exec = 0 '\000'}, other = {read = 1 '\001', write = 0 '\000', exec = 0 '\000'}}}

Sent by readdir-ahead to fuse-bridge:

$7 = {ia_flags = 0, ia_ino = 0, ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0, ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = "\004\343F\357\225sB\346\222\316\aq\005\305\323)", ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}}

Sent by fuse-bridge to the fuse-kernel:

$9 = {ia_flags = 0, ia_ino = 0, ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_blksize = 131072, ia_blocks = 0, ia_atime = 0, ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0, ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = "\004\343F\357\225sB\346\222\316\aq\005\305\323)", ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}}

gluster-ant commented 4 years ago

Time: 20191129T04:08:01 rgowdapp at redhat commented: (In reply to Ravishankar N from comment #8)

(In reply to Ravishankar N from comment #5)

So the reason it is not happening in 3.12.x is because it did not have the function rda_inode_ctx_update_iatts() introduced by the commit "*c9bde3021 - performance/readdir-ahead: keep stats of cached dentries in sync with modifications"

What is happening is this: -When consistent-metadata is enabled, afr calls afr_zero_fill_stat() in its truncate call back and sets ctime and nlink to zero in the postbuf. This behaviour is already there in glusterfs 3.12 This forces fuse to trigger a lookup which hits AFR (and thus the bricks) and the application gets the current ia_size when it does the stat. -But in the current code-base, when ctime is zero, the function rda_inode_ctx_update_iatts() zeroes out everything else also in the postbuf except gfid and ia_type. This somehow is not causing FUSE to trigger a lookup.

Raghavendra, is there some field in the postbuf that we need to retain and not zero out (in addition to the gfid and ia_type) in rda_inode_ctx_update_iatts() so that fuse sends a lookup when application does a stat?

comment #5 says things work fine if we wait for 1s which is incidentally 'attribute-timeout' default value. When we wrote that patch in readdir-ahead, we were expecting kernel to only refresh the 'entry' (path to inode resolution) without touching the 'attributes' (rest of the stat). It looks like something is broken there.

Just to get more details, when was the last stat sent (before zeroed out postbuf was sent)? Is the 1s you mentioned was,

  1. after the lookup/readdirp response that returned zeroed out stat (except for gfid and type)?
  2. after a valid lookup/readdirp/stat/fstat response that returned a completely valid stat (all fields are valid)?

A more correct way to make sure kernel forgets the inode and initiates a lookup (without being subjected to various timeouts like attribute-timeout, entry-timeout, md-cache-timeout) is to call inode_invalidate on the inode.

gluster-ant commented 4 years ago

Time: 20191129T04:11:58 rgowdapp at redhat commented:

Just to get more details, when was the last stat sent (before zeroed out postbuf was sent)? Is the 1s you mentioned was,

... when was the last stat (with all fields valid) sent...

gluster-ant commented 4 years ago

Time: 20191129T04:55:09 ravishankar at redhat commented: (In reply to Raghavendra G from comment #10) (In reply to Raghavendra G from comment #11)

Just to get more details, when was the last stat (with all fields valid) sent? Is the 1s you mentioned was,

  1. after the lookup/readdirp response that returned zeroed out stat (except for gfid and type)?
  2. after a valid lookup/readdirp/stat/fstat response that returned a completely valid stat (all fields are valid)?

For the test "[root@vm1 fuse_mnt]# truncate -s 52 $FILE && stat -c "%s" $FILE", here are the fops and their counts from profile info: Note: there is no readdir(p) involved. With readdir-ahead on:

1 RELEASE 1 FLUSH 2 FINODELK 1 OPEN 1 FTRUNCATE 2 FXATTROP 2 LOOKUP

With readdir-ahead disabled:

1 RELEASE 2 FINODELK 1 FLUSH 1 OPEN 1 FTRUNCATE 2 FXATTROP 1 GETXATTR 4 LOOKUP

So with readdir-ahead disabled (i.e. rda_inode_ctx_update_iatts() is not involved), there are 2 extra lookups after the ftruncate that helps getting the correct size. With readdir ahead enabled, those 2 lookup do not seem to be coming which means the kernel is still considering the values returned during the truncate cbk as valid.

stale[bot] commented 4 years ago

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] commented 4 years ago

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.