houseabsolute / Log-Dispatch

Dispatches messages to one or more outputs
https://metacpan.org/release/Log-Dispatch/
Other
12 stars 29 forks source link

Tests fail on NFS-mounted temp drive #60

Open jmaslak opened 4 years ago

jmaslak commented 4 years ago

I'm seeing an issue on a NFS-mounted temp directory with t/file-locked.t failing. It does not fail with local disks. I wouldn't be surprised if my locking configuration isn't quite right on this host (nfs4, Linux kernel 4.4.0).

Interesting output (I set the tmp directory manually to an NFS-mounted subdir under my home directory):

t/file-locked.t ..
# Subtest: close_after_write = 0
    ok 1 - no exception forking children and writing to file
    not ok 2 - 10 children exited

    #   Failed test '10 children exited'
    #   at t/file-locked.t line 109.
    #          got: '9'
    #     expected: '10'
    ok 3 - 10812 exited with 0
    ok 4 - 10813 exited with 0
    ok 5 - 10808 exited with 0
    ok 6 - 10806 exited with 0
    ok 7 - 10815 exited with 0
    ok 8 - 10809 exited with 0
    ok 9 - 10807 exited with 0
    ok 10 - 10810 exited with 0
    ok 11 - 10811 exited with 0
    not ok 12 - file contains expected content

    #   Failed test 'file contains expected content'
    #   at t/file-locked.t line 138.
    #     Structures begin differing at:
    #          $got->[9] = '2'
    #     $expected->[9] = '1'
    # 1
    # 1
    # 1
    # 1
    # 1
    # 1
    # 2
    # 2
    # 2
    # 3
    # 2
    # 2
    # 3
    # 3
    # 2
    # 3
    # 3
    # 1
    # 1
    # 2
    # 3
    # 3
    # 2
    # 3
    # 1
    # 2
    # 3
    1..12
    # Looks like you failed 2 tests of 12.
not ok 1 - close_after_write = 0
autarch commented 4 years ago

I'm not surprised that there are issues with locking on NFS. IIRC the normal flock call will not work on NFS. I think the best thing for this particular module is just to add a note to the docs.

jmaslak commented 4 years ago

Yes, although my understanding is the kernel version I'm using is supposed to emulate local locking behavior using fcntl(2) byte range locks when flock() is called, at least for nfs4. I wrote some test programs that seem to confirm that flock() (via perl) does work, at least for simple cases (one process appending to a file and grabbing an exclusive lock will block a second process trying to grab an exclusive lock on the same file).

I was able to get this work going back to the old style local-only flock locking behavior (I.E. local_lock=flock, which also requires downgrading to nfs3). I'm still kind of baffled, I'll have to read up on how locks work with nfs4 sometime. 🤷🏻‍♀️ For what I'm doing, I don't need network-aware flock(), so I guess it'll work.

autarch commented 4 years ago

Hmm, I really don't know what's going on here. One of the child processes appears to either have not run or to not have exited after 10 seconds.

So there's a couple possibilities:

All of which is to say I'm confused.

jmaslak commented 4 years ago

The processes are just taking a long time. I was seeing sometimes 8, sometimes 9 processes running right.

The locking is working fine on nfs4...sort of. The locks are properly excluding concurrent access. But they are SLOW. I also validated all the processes start, but it takes a long time sometimes to do the write to the file. If I bump the 10s alarm to 30s in t/file-locked.t, it runs just fine for me (albeit slowly). 20s was usually good enough, but every once in a while caused an issue.

I am not sure why it is so slow - this is a NAS connected via 10G to the server, in the same rack, so there isn't much latency or the like, and - but something makes these locks slow (I'll dig into that sometime, it seems slower than it should be - I wonder how the server manages multiple lock requests). Regardless, the speed issue isn't the module's fault. But in the meantime, maybe just bump the alarm time up? I also tried to drop the number of simultaneous processes down, but that didn't help (I.E. 2 children instead of 10 children still needed the 30s alarm).

autarch commented 4 years ago

Maybe the locking emulation that NFS provides is slow? And yes, we could increase the alarm time a bit.