openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.56k stars 1.74k forks source link

tar: a.out: file changed as we read it #764

Closed ryao closed 11 years ago

ryao commented 12 years ago

I suspect that there is a race condition in the mmap code. gcc hello.c && scanelf -Xxz -r a.out && tar -cf hello.tgz a.out will work with a simple hello world program on any filesystem, but ZFS, where we see 'tar: a.out: file changed as we read it' printed as an error message.

There is a downstream Gentoo bug tracking this issue:

https://bugs.gentoo.org/show_bug.cgi?id=411555

behlendorf commented 12 years ago

Why in particular do you suspect mmap() rather than something like the inode updates? It's true that when using mmap() zfs keeps a copy of the data in the page cache and one in the arc. However, we very carefully keep them in sync and this is actually something which xfstests does stress quite heavily.

ryao commented 12 years ago

Anthony G. Basile, one of the Gentoo Hardened developers, and I spent some time debugging this. The following program should be equivalent to scanelf -Xxz -r a.out, but if I replace scanelf -Xxz -r a.out with it, I cannot reproduce this issue:

include <sys/types.h>

include <sys/stat.h>

include

int main() { int fd = open("a.out", O_RDWR); lseek(fd, 0x10, SEEK_CUR); write(fd, "A", 1); close(fd); }

The main difference appears to be the use of mmap() according to strace, which caused me to suspect the mmap() code. I did not consider inode updates when debugging this.

I am not able to give ZFS the attention I would like to give it at the present time. I will look at inode updates when I get a chance.

maxximino commented 12 years ago

There's something worse: (the problem "traverses" sync and sleep )

sigmaii tmp # echo 'int main(){}' >hello.c
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sleep 10 && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && sleep 10 &&tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it

Those results are systematical,not the result of a single unlucky run.

The system was idle during this test. The pool has a log device on a OCZ Vertex 3, so transaction commits should be a lot faster than 10 seconds. I had 13GB of RAM free. (tot. 16GB)

Kernel is 3.1.10 and ZFS version is very old, around commit zfsonlinux/zfs@ab26409db753bb087842ab6f1af943f3386c764f, but it is so stable and fast that I'm not minimally spurred into upgrading.

ryao commented 12 years ago

As an additional data point, gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out also suffers from this issue. :/

behlendorf commented 12 years ago

This may be relatively harmless (but still a bug we need to fix). It's my understanding that the "file changed as we read it" tar warning message occurs when the mtime of the source file changes between the start and end of the file read. It's possible we're not strictly updating the mtime when the file is written via mmap(2), however the contents of the file are kept properly in sync. When you replace scanelf with your custom app now get a proper mtime update and the warning goes away.

maxximino commented 12 years ago

Frequently {m,c}time goes backwards!!

sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && stat a.out &&tar -cf hello.tgz a.out ; stat a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
  File: 'a.out'
  Size: 7831            Blocks: 7          IO Block: 8192   regular file
Device: 10h/16d Inode: 8927439     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-31 20:25:50.305634000 +0200
Modify: 2012-05-31 20:25:50.302329865 +0200
Change: 2012-05-31 20:25:50.302329865 +0200
 Birth: -
tar: a.out: file changed as we read it
  File: 'a.out'
  Size: 7831            Blocks: 7          IO Block: 8192   regular file
Device: 10h/16d Inode: 8927439     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-31 20:25:51.755515000 +0200
Modify: 2012-05-31 20:25:50.300682000 +0200
Change: 2012-05-31 20:25:50.300773000 +0200
 Birth: -
ryao commented 12 years ago

This affects generation of gcc binary packages on Gentoo. I have known about this for a while, but I could not reproduce it without building GCC on Gentoo until recently.

ryao commented 12 years ago

This is still an issue as of 0.6.0-rc11.

ryao commented 12 years ago

A few observations.

  1. This issue cannot be reproduced using BSD Tar in place of GNU Tar. i.e. gcc hello.c && scanelf -Xxz -r a.out && bsdtar -cf hello.tgz a.out does not exhibit any issues.
  2. Doing a touch operation on the file before executing tar will prevent this issue from occurring. e.g. gcc hello.c && scanelf -Xxz -r a.out && touch a.out && tar -cf hello.tgz a.out
  3. The md5sum of a.out is always consistent.
  4. I cannot reproduce this issue on FreeBSD 9.1-BETA1.

This suggests that this issue is a ZFSOnLinux-specific regression in how mtime updates are handled. It seems that GNU Tar is the only software sensitive to this.

ryao commented 12 years ago

Rick Farina informed me in IRC that this behavior breaks Gentoo's catalyst software when used on ZFS. That software is used by Gentoo's Release Engineering team to generate LiveCDs and stage3 tarballs and breaking it is a fairly serious thing.

Since FreeBSD is unaffected, I took a look at their code. They wrote zfs_delmap() to handle this. Unfortunately, Linux's memory management code does not provide similarity functionality, so it is not portable to Linux.

https://github.com/freebsd/freebsd-head/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c

When munmap() is called, Linux will tear down the mapping, but it will not notify the filesystem until the last open file handle is closed. It should be possible to address this by updating mtime on dirty files when release() is called.

I wrote a patch based on that idea. It made the minimal test case work:

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 89f0f60..2b4ddcd 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -222,6 +222,13 @@ zfs_close(struct inode *ip, int flag, cred_t *cr)
    ZFS_ENTER(zsb);
    ZFS_VERIFY_ZP(zp);

+   /* 
+    * Update mtime
+    * FIXME: This should only be done on dirty znodes
+    */
+   if ( 1 == zp->z_is_mapped )
+       zfs_inode_update(zp);
+
    /*
     * Zero the synchronous opens in the znode.  Under Linux the
     * zfs_close() hook is not symmetric with zfs_open(), it is

Unfortunately, I am only checking that the file is mmapped when I also need to check that it is dirty. The consequence is that any files that are mmapped, but not modified, will have their mtimes updated. I will look into how to check to see if the file is dirty later. I appreciate any comments that people have on this.

behlendorf commented 12 years ago

@ryao Your patch seems to fix the issue, but I don't think it quite gets at the heart of the problem. The issue I thought was that zfs_putpage() was only updating the znode and never the inode which resulted in lazy inode mtime update which upset tar.

So I fixed this with the following debug patch zfs_putpage()->zfs_tstamp_update_setup() and to my surprise this didn't resolve the issue.

@@ -1095,6 +1095,7 @@ zfs_tstamp_update_setup(znode_t *zp, uint_t flag, uint64_t

        if (flag & ATTR_MTIME) {
                ZFS_TIME_ENCODE(&now, mtime);
+               ZFS_TIME_DECODE(&ZTOI(zp)->i_mtime, mtime);
                if (ZTOZSB(zp)->z_use_fuids) {
                        zp->z_pflags |= (ZFS_ARCHIVE |
                            ZFS_AV_MODIFIED);

However your patch does seem to solve the problem and I don't quite see why. It just ensures that the znode and inode mtime (and all the other attributes) are in sync at the final close. The question is why isn't that already the case with the above patch in place? We must be missing another update somewhere in the code.

@ryao Can you explain what's going on here? What am I missing?

The consequence is that any files that are mmapped, but not modified, will have their mtimes updated.

Yes and no. Their mtimes will be updated but only to reflect the mtime stored in the znode which should be the same as the current inode mtimes.

This whole znode/inode->mtime issue is a lingering bit of cleanup for the Linux port. Under Solaris I believe this information wasn't stored in their generic vnode structure so they instead put it in their znode structure. Now under Linux we have an inode which should be the authoritative location for this but we always have the existing code which expects it to be in the znode. Therefore, as a merium term fix I added a zfs_inode_update() function to sync the znode fields with the inode fields.

What I've always meant to do is go back and unify the two locations in to a single one where possible.

ryao commented 12 years ago

@behlendorf zfs_putpage is invoked when the flush occurs. That is sometime after final close and applications expect to see an update on close. Since known cases where this is a problem only involve a single open file handle, updating mtime at final close makes things work. @maxximino commented that the time is going backward, which implies a cached time is being set on flush. I have not spotted the code responsible for that, but it seems that it will not move mtime to be some time in the past if we update it on the final close, which makes userland happy.

ryao commented 12 years ago

On second thought, the backward time is likely still occurring, but my minimal test case does not detect it. That would explain a report by Rick Farina (Zero_Chaos in IRC) that my patch didn't solve the problem for him.

ryao commented 11 years ago

I have verified that this patch solves the original problem that was detected in Gentoo. It appears that Zero_Chaos had applied my patch to sys-fs/zfs instead of sys-fs/zfs-kmod, which prevented it from having any effect on his system.