python / cpython

The Python programming language
https://www.python.org
Other
63.62k stars 30.47k forks source link

test.test_pathlib.PosixPathTest.test_touch_common fails on FreeBSD with ZFS #64037

Closed 5df057ac-c83d-447e-8c45-910556b17608 closed 2 years ago

5df057ac-c83d-447e-8c45-910556b17608 commented 10 years ago
BPO 19838
Nosy @jcea, @pitrou, @larryhastings, @ezio-melotti, @PCManticore, @koobs, @iritkatriel
Files
  • issue19838.patch
  • c_test_case_for_issue_19838.c: Simple test case written in C reproducing the behavior
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-bug', 'tests'] title = 'test.test_pathlib.PosixPathTest.test_touch_common fails on FreeBSD with ZFS' updated_at = user = 'https://github.com/PCManticore' ``` bugs.python.org fields: ```python activity = actor = 'larry' assignee = 'none' closed = True closed_date = closer = 'larry' components = ['Tests'] creation = creator = 'Claudiu.Popa' dependencies = [] files = ['35553', '36262'] hgrepos = [] issue_num = 19838 keywords = ['patch'] message_count = 38.0 messages = ['204786', '204794', '204795', '204813', '204814', '204815', '220143', '220144', '220167', '224056', '224063', '224064', '224066', '224625', '224627', '224628', '224640', '224642', '224656', '224664', '224679', '224683', '224684', '224685', '224686', '224687', '224698', '224703', '224705', '224706', '224750', '224751', '224770', '224779', '224789', '224810', '408377', '408426'] nosy_count = 10.0 nosy_names = ['jcea', 'pitrou', 'larry', 'ezio.melotti', 'Claudiu.Popa', 'francismb', 'koobs', 'puppet', 'larstiq', 'iritkatriel'] pr_nums = [] priority = 'normal' resolution = 'out of date' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue19838' versions = ['Python 3.4'] ```

    5df057ac-c83d-447e-8c45-910556b17608 commented 10 years ago

    Hi!

    test_touch_common fails when using 8.3-STABLE FreeBSD 8.3-STABLE and Python 3.4.0b1 (default:a0ec33efa743+, Nov 30 2013, 10:36:58).

    Here are the tracebacks:

    \====================================================================== FAIL: test_touch_common (test.test_pathlib.PathTest) ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/tank/libs/cpython/Lib/test/test_pathlib.py", line 1402, in test_touch_common
        self.assertGreaterEqual(st.st_mtime_ns, old_mtime_ns)
    AssertionError: 1385800632000000000 not greater than or equal to 1385800632871814968

    ====================================================================== FAIL: test_touch_common (test.test_pathlib.PosixPathTest) ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/tank/libs/cpython/Lib/test/test_pathlib.py", line 1402, in test_touch_common
        self.assertGreaterEqual(st.st_mtime_ns, old_mtime_ns)
    AssertionError: 1385800633000000000 not greater than or equal to 1385800633042814928

    Ran 319 tests in 0.368s

    FAILED (failures=2, skipped=85) test test_pathlib failed 1 test failed: test_pathlib

    This issue seems to be related with bpo-15745.

    pitrou commented 10 years ago

    I don't really know what to do with this. I think you'll have to investigate a bit and find out exactly what happens during the test.

    pitrou commented 10 years ago

    As a data point, if not for ZFS, test_pathlib passes on FreeBSD 6.4: http://buildbot.python.org/all/builders/x86%20FreeBSD%206.4%203.x/builds/4261/steps/test/logs/stdio and FreeBSD 7.2: http://buildbot.python.org/all/builders/x86%20FreeBSD%207.2%203.x/builds/4731/steps/test/logs/stdio

    koobs commented 10 years ago

    Is this similar/related to bpo-15745?

    I took both of my buildbots (koobs-freebsd9, koobs-freebsd10) off ZFS until it could be resolved

    koobs commented 10 years ago

    Sorry Claudiu I missed the issue reference in your comment

    5df057ac-c83d-447e-8c45-910556b17608 commented 10 years ago

    I believe it's similar, both test_os and test_pathlib fails when executed from within a ZFS container. I checked, I did a fresh checkout of Python inside a normal directory and run the tests there, they ran without problems.

    5df057ac-c83d-447e-8c45-910556b17608 commented 10 years ago

    Since bpo-15745 hasn't been fixed yet, would be okay to skip these tests when the test suite runs from a ZFS container? Currently, these failures are a nuissance when running the test suite.

    koobs commented 10 years ago

    I'd like to put the buildbot slave instances back onto ZFS for broader Disk/IO test coverage for Python and other projects as well as to gain some administrative disk utilisation flexibility.

    These two issues have unfortunately precluded that, and there's much more value to projects with a reliably green buildbot than one that is red, ending up ignored, and hiding other issues or regressions in the meantime.

    +1 on disabling these tests and leaving the issues open so someone can pick them up at a later date and have a ZFS environment in which to reproduce/resolve them in a custom builder

    5df057ac-c83d-447e-8c45-910556b17608 commented 10 years ago

    Here's a patch that marks those tests as expected failures if the source checkout is inside a zfs container. It uses df -t zfs, it was the easiest way I could find to detect that we are running from a zfs container.

    larryhastings commented 10 years ago

    This test also fails on Linux when using ZFS.

    larryhastings commented 10 years ago

    Sorry to have such an awful configuration, but this approach won't fix the problem for me.

    I use a Linux "encrypted home directory", which uses crazy loopback mount logic to create an on-the-fly decrypted representation of my home directory. So my home directory is actually on an "ecryptfs" device:

    % df -T
    Filesystem            Type      Mounted on
    ...
    home                  zfs       /home
    /home/larry/.Private  ecryptfs  /home/larry

    Maybe we could ignore deltas below a certain race-condition threshold? Perhaps a millisecond? I think there are already tests like that.

    larryhastings commented 10 years ago

    (By "this approach" I meant "the approach employed in the first patch posted". Sorry for the ambiguity.)

    pitrou commented 10 years ago

    So, I'm not against the patch, but it would be nice to diagnose where exactly the issue comes from.

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    @Antoine: We have investigated a bit here and it seems that ZFS just keeps a timestamp in seconds in the memory.

    See function uberblock_update in uberblock.c ( http://people.freebsd.org/~gibbs/zfs_doxygenation/html/d3/d65/uberblock_8c_source.html - Row 57) ub->ub_timestamp = gethrestime_sec();

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    After thinking and looking a bit on it, this patch isn't the best to put into the code what I can see.

    If any of these things really break, because of something completely different - we won't notice it.

    A more accurate solution is to maybe look on a different way to compare the timestamps that works on systems that just support seconds too?

    5df057ac-c83d-447e-8c45-910556b17608 commented 10 years ago

    I agree. The patch was merely a temporary solution to the real problem.

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    bpo-15745 is probably related to this issue

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    So me and @larstiq have been sitting with this during the sprint here and we have figured out some things but haven't had the time to put together a patch.

    If you put two p.touch() before starting the testing, it will not fail on BSD.

    What we have found out so far is that during file creation the resolution of the timestamp is higher then at the touch attempt when a file exists.

    The main issue as we see it is that os.stat can return a timestamp with more granularity than os.utime can set.

    larryhastings commented 10 years ago

    os.stat can return something more precise than nanosecond resolution?

    fa53c3d6-e97b-4089-8725-a7655c5e9dde commented 10 years ago

    What we have found out so far is that during file creation the resolution of the timestamp is higher then at the touch attempt when a file exists.

    Could it help to create 2 files (file 1, wait a bit, file 2) and then do the checks only with file 1?

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    @Larry

    The issue is the following:

    In the test after the first p.touch (Pdb) old_mtime 1407128672.4133856 (Pdb) old_mtime_ns 1407128672413385711

    After second p.touch (Pdb) st.st_mtime_ns 1407128689000000000 (Pdb) st.st_mtime 1407128689.0

    So the issue is that when utime is used it can't set the time with nanosecond, but just with second resolution.

    @Francis - A quick fix that would work, but it is a bit dirty but it will work correctly is to just add two p.touch() before doing anything.

    Not really the best solution but could work.

    larryhastings commented 10 years ago

    So why can't you use the "ns" parameter for os.utime()?

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    @Larry - the problem arrives after the second touch, which does os.utime(FILE, None) if the file exists.

    The os.utime that is in the test-code doesn't really break anything, what breaks the test is the second touch.

    larryhastings commented 10 years ago

    A cursory read of the source code suggests everybody is doing the right thing. Can you run this with a debug build of CPython, put a breakpoint in posix_utime, and check that when it's called for the second p.touch that it's correctly calling the underlying function to say "set the time of this file to the current time"?

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    @Larry - we did that yesterday and on the FreeBSD-machine the regular utimes was used and not utimensat. utimensat is not available on FreeBSD 10 so it fall backs to regular utimes to be used in os.utime. But when a file is created it is the file system that sets the utime, so it will have nanoseconds.

    larryhastings commented 10 years ago

    And was the "time" argument passed in to utimes() set to NULL?

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    I'm no expert at all at GDB, but I will give it a go and if this is wrong I can try a lot more.

    Breakpoint 1, posix_utime (self=0xf00918ed8, args=0xf014459f8, kwargs=0x0) at ./Modules/posixmodule.c:4838 4838 PyObject *times = NULL; Current language: auto; currently minimal 4839 PyObject *ns = NULL; 4840 int dir_fd = DEFAULT_DIR_FD; 4841 int follow_symlinks = 1; 4843 "follow_symlinks", NULL}; 4856 memset(&path, 0, sizeof(path)); 4857 path.function_name = "utime"; 4858 memset(&utime, 0, sizeof(utime_t)); 4860 path.allow_fd = 1; 4862 if (!PyArg_ParseTupleAndKeywords(args, kwargs, 4875 if (times && (times != Py_None) && ns) { 4903 else if (ns) { 4919 utime.now = 1; 4927 if (path_and_dir_fd_invalid("utime", &path, dir_fd) || 963 if (!path->narrow && !path->wide && (dir_fd != DEFAULT_DIR_FD)) { 4928 dir_fd_and_fd_invalid("utime", dir_fd, path.fd) || 974 if ((dir_fd != DEFAULT_DIR_FD) && (fd != -1)) { 4929 fd_and_follow_symlinks_invalid("utime", path.fd, follow_symlinks)) 986 if ((fd > 0) && (!follow_symlinks)) { 4933 if ((dir_fd != DEFAULT_DIR_FD) && (!follow_symlinks)) { 4974 Py_BEGIN_ALLOW_THREADS 4977 if ((!follow_symlinks) && (dir_fd == DEFAULT_DIR_FD)) 4989 if (path.fd != -1) 4994 result = utime_default(&utime, path.narrow); 4800 UTIME_TO_TIMEVAL; 4801 return utimes(path, time); (gdb) print time $1 = {\<text variable, no debug info>} 0xf01191d20 \<time> (gdb) print path $2 = {function_name = 0x58bb79 "utime", argument_name = 0x0, nullable = 0, allow_fd = 1, wide = 0x0, narrow = 0xf0099c0d0 "/tmp/test", fd = -1, length = 9, object = 0xf01445f90, cleanup = 0xf0099c0a0} 4996 Py_END_ALLOW_THREADS 4998 if (result \< 0) { 5006 Py_INCREF(Py_None); 787 if (path->cleanup) { 788 Py_CLEAR(path->cleanup); 5016 }

    larryhastings commented 10 years ago

    The "no debug info" makes me think you're using a release build. Please recompile making sure that "-O0" and "-g" are both set. By default the makefile has "-O3" and "-g", so normally you just change the "-O3" to "-O0", then "make clean" and "make".

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    I used configure with --with-pydebug, will look closer on it when I get to the office

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    To confirm, my OPT-line is OPT= -g -O0 -Wall -Wstrict-prototypes

    larryhastings commented 10 years ago

    Yup, you've got a debug build there. I'm baffled by that gdb "no debug info" line then.

    At line 4801 could you also print *utime print tv

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    Larry: Result of that: (gdb) print *utime Structure has no component named operator*. (gdb) print tv No symbol "tv" in current context.

    You can also catch me on IRC with the same nickname as here

    larryhastings commented 10 years ago

    I talked to puppet on IRC for a while and we figured out the following about his OS:

    Just to confirm, he ran this script:

    import os b = '/tmp/test2' open(b,'a').close() before = os.stat(b) os.utime(b, None) after = os.stat(b) os.unlink(b) print("before:", before.st_mtime_ns) print(" after:", after.st_mtime_ns) print("before \<= after", before.st_mtime_ns \<= after.st_mtime_ns)

    and it consistently prints "before \<= after False".

    *facepalm*

    Since utimes supports microsecond resolution, we could in theory work around this problem by explicitly specifying the current time when using utimes(). If we did that, we might want to also see if this behavior affects futimes() and lutimes().

    Functions in posixmodule are expected to be atomic, and this would mean two system calls instead of one--so maybe we should only use this workaround on platforms with the bug?

    larryhastings commented 10 years ago

    And to further confirm, on my Linux machine with a ZFS partition:

    It still fails, but here the race is much closer:

    before: 1407191476502407010
     after: 1407191476502090829
    before <= after False

    If I run the test on a different filesystem (e.g. on /tmp which is ext4) the problem doesn't occur.

    My guess: it's ZFS's fault in both cases, it's not using the highest-precision timer available.

    larryhastings commented 10 years ago

    Attached is a simple C test case for the same behavior. On my Linux machine this prints "This is insane!" on a ZFS partition and "Everything is okay" on an ext4 partition.

    The test case won't work unmodified on FreeBSD. Try commenting out the call to "utimensat", and uncommenting the call to "utimes". I don't guarantee that's sufficient.

    If nothing else, we can be certain it's not a Python bug.

    951fbf2b-0f92-4afc-bb66-f8bf4ae297f3 commented 10 years ago

    Result:

    Before: 1407219004 988716892 After: 1407219004 0 Before \<= After 0 This is insane!

    iritkatriel commented 2 years ago

    bpo-15745 has been fixed in the meantime (about 6 years ago).

    If nobody objects I will close this as out of date. If there is still a problem with the test is will be reported for a current python version.

    larryhastings commented 2 years ago

    I can confirm that the behavior is fixed in ZFS on Linux. My test case C program now prints "Everything is okay." when run on a ZFS partition on Linux, and test_touch_common from the current tree passes every time. ZFS fixing this was the best possible outcome.

    I'll go ahead and just close it now--why wait! If somebody confirms that the test still fails on FreeBSD, please open a new issue.

    Thanks for checking in, Irit!