Closed ryao closed 11 years ago
Why in particular do you suspect mmap() rather than something like the inode updates? It's true that when using mmap() zfs keeps a copy of the data in the page cache and one in the arc. However, we very carefully keep them in sync and this is actually something which xfstests does stress quite heavily.
Anthony G. Basile, one of the Gentoo Hardened developers, and I spent some time debugging this. The following program should be equivalent to scanelf -Xxz -r a.out
, but if I replace scanelf -Xxz -r a.out
with it, I cannot reproduce this issue:
include <sys/types.h>
include <sys/stat.h>
include
int main() { int fd = open("a.out", O_RDWR); lseek(fd, 0x10, SEEK_CUR); write(fd, "A", 1); close(fd); }
The main difference appears to be the use of mmap() according to strace, which caused me to suspect the mmap() code. I did not consider inode updates when debugging this.
I am not able to give ZFS the attention I would like to give it at the present time. I will look at inode updates when I get a chance.
There's something worse: (the problem "traverses" sync and sleep )
sigmaii tmp # echo 'int main(){}' >hello.c
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out
TYPE PAX FILE
ET_EXEC ---xer a.out
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sleep 10 && tar -cf hello.tgz a.out
TYPE PAX FILE
ET_EXEC ---xer a.out
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && tar -cf hello.tgz a.out
TYPE PAX FILE
ET_EXEC ---xer a.out
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && sleep 10 &&tar -cf hello.tgz a.out
TYPE PAX FILE
ET_EXEC ---xer a.out
tar: a.out: file changed as we read it
Those results are systematical,not the result of a single unlucky run.
The system was idle during this test. The pool has a log device on a OCZ Vertex 3, so transaction commits should be a lot faster than 10 seconds. I had 13GB of RAM free. (tot. 16GB)
Kernel is 3.1.10 and ZFS version is very old, around commit zfsonlinux/zfs@ab26409db753bb087842ab6f1af943f3386c764f, but it is so stable and fast that I'm not minimally spurred into upgrading.
As an additional data point, gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out
also suffers from this issue. :/
This may be relatively harmless (but still a bug we need to fix). It's my understanding that the "file changed as we read it" tar warning message occurs when the mtime of the source file changes between the start and end of the file read. It's possible we're not strictly updating the mtime when the file is written via mmap(2), however the contents of the file are kept properly in sync. When you replace scanelf
with your custom app now get a proper mtime update and the warning goes away.
Frequently {m,c}time goes backwards!!
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && stat a.out &&tar -cf hello.tgz a.out ; stat a.out
TYPE PAX FILE
ET_EXEC ---xer a.out
File: 'a.out'
Size: 7831 Blocks: 7 IO Block: 8192 regular file
Device: 10h/16d Inode: 8927439 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2012-05-31 20:25:50.305634000 +0200
Modify: 2012-05-31 20:25:50.302329865 +0200
Change: 2012-05-31 20:25:50.302329865 +0200
Birth: -
tar: a.out: file changed as we read it
File: 'a.out'
Size: 7831 Blocks: 7 IO Block: 8192 regular file
Device: 10h/16d Inode: 8927439 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2012-05-31 20:25:51.755515000 +0200
Modify: 2012-05-31 20:25:50.300682000 +0200
Change: 2012-05-31 20:25:50.300773000 +0200
Birth: -
This affects generation of gcc binary packages on Gentoo. I have known about this for a while, but I could not reproduce it without building GCC on Gentoo until recently.
This is still an issue as of 0.6.0-rc11.
A few observations.
gcc hello.c && scanelf -Xxz -r a.out && bsdtar -cf hello.tgz a.out
does not exhibit any issues.gcc hello.c && scanelf -Xxz -r a.out && touch a.out && tar -cf hello.tgz a.out
This suggests that this issue is a ZFSOnLinux-specific regression in how mtime updates are handled. It seems that GNU Tar is the only software sensitive to this.
Rick Farina informed me in IRC that this behavior breaks Gentoo's catalyst software when used on ZFS. That software is used by Gentoo's Release Engineering team to generate LiveCDs and stage3 tarballs and breaking it is a fairly serious thing.
Since FreeBSD is unaffected, I took a look at their code. They wrote zfs_delmap() to handle this. Unfortunately, Linux's memory management code does not provide similarity functionality, so it is not portable to Linux.
When munmap() is called, Linux will tear down the mapping, but it will not notify the filesystem until the last open file handle is closed. It should be possible to address this by updating mtime on dirty files when release() is called.
I wrote a patch based on that idea. It made the minimal test case work:
diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 89f0f60..2b4ddcd 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -222,6 +222,13 @@ zfs_close(struct inode *ip, int flag, cred_t *cr)
ZFS_ENTER(zsb);
ZFS_VERIFY_ZP(zp);
+ /*
+ * Update mtime
+ * FIXME: This should only be done on dirty znodes
+ */
+ if ( 1 == zp->z_is_mapped )
+ zfs_inode_update(zp);
+
/*
* Zero the synchronous opens in the znode. Under Linux the
* zfs_close() hook is not symmetric with zfs_open(), it is
Unfortunately, I am only checking that the file is mmapped when I also need to check that it is dirty. The consequence is that any files that are mmapped, but not modified, will have their mtimes updated. I will look into how to check to see if the file is dirty later. I appreciate any comments that people have on this.
@ryao Your patch seems to fix the issue, but I don't think it quite gets at the heart of the problem. The issue I thought was that zfs_putpage()
was only updating the znode and never the inode which resulted in lazy inode mtime update which upset tar.
So I fixed this with the following debug patch zfs_putpage()->zfs_tstamp_update_setup()
and to my surprise this didn't resolve the issue.
@@ -1095,6 +1095,7 @@ zfs_tstamp_update_setup(znode_t *zp, uint_t flag, uint64_t
if (flag & ATTR_MTIME) {
ZFS_TIME_ENCODE(&now, mtime);
+ ZFS_TIME_DECODE(&ZTOI(zp)->i_mtime, mtime);
if (ZTOZSB(zp)->z_use_fuids) {
zp->z_pflags |= (ZFS_ARCHIVE |
ZFS_AV_MODIFIED);
However your patch does seem to solve the problem and I don't quite see why. It just ensures that the znode and inode mtime (and all the other attributes) are in sync at the final close. The question is why isn't that already the case with the above patch in place? We must be missing another update somewhere in the code.
@ryao Can you explain what's going on here? What am I missing?
The consequence is that any files that are mmapped, but not modified, will have their mtimes updated.
Yes and no. Their mtimes will be updated but only to reflect the mtime stored in the znode which should be the same as the current inode mtimes.
This whole znode/inode->mtime issue is a lingering bit of cleanup for the Linux port. Under Solaris I believe this information wasn't stored in their generic vnode structure so they instead put it in their znode structure. Now under Linux we have an inode which should be the authoritative location for this but we always have the existing code which expects it to be in the znode. Therefore, as a merium term fix I added a zfs_inode_update()
function to sync the znode fields with the inode fields.
What I've always meant to do is go back and unify the two locations in to a single one where possible.
@behlendorf zfs_putpage is invoked when the flush occurs. That is sometime after final close and applications expect to see an update on close. Since known cases where this is a problem only involve a single open file handle, updating mtime at final close makes things work. @maxximino commented that the time is going backward, which implies a cached time is being set on flush. I have not spotted the code responsible for that, but it seems that it will not move mtime to be some time in the past if we update it on the final close, which makes userland happy.
On second thought, the backward time is likely still occurring, but my minimal test case does not detect it. That would explain a report by Rick Farina (Zero_Chaos in IRC) that my patch didn't solve the problem for him.
I have verified that this patch solves the original problem that was detected in Gentoo. It appears that Zero_Chaos had applied my patch to sys-fs/zfs instead of sys-fs/zfs-kmod, which prevented it from having any effect on his system.
I suspect that there is a race condition in the mmap code.
gcc hello.c && scanelf -Xxz -r a.out && tar -cf hello.tgz a.out
will work with a simple hello world program on any filesystem, but ZFS, where we see 'tar: a.out: file changed as we read it' printed as an error message.There is a downstream Gentoo bug tracking this issue:
https://bugs.gentoo.org/show_bug.cgi?id=411555