hpc / mpifileutils

File utilities designed for scalability and performance.
https://hpc.github.io/mpifileutils
BSD 3-Clause "New" or "Revised" License
162 stars 64 forks source link

mknod() failed (errno=2 No such file or directory) #574

Open bdevcich opened 1 month ago

bdevcich commented 1 month ago

Hello,

I have been seeing this issue when using dcp to recursive copy a directory from one lustre filesystem to another. This is being ran via mpirun and over multiple hosts (e.g. mpirun -H host1,host2). I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host).

The error in the dcp output suggests no such file or directory, which seems like the parent directory (e.g. /mnt/nnf/.../job2/) is not being created first. I originally thought the Original directory exists, skip the creation output might suggest that none of the directories are being created, but that output shows up the case where this issue does not surface.

I don't have a way to reproduce this directly, but if I run our tests enough times this will be hit.

Each test is copying the same source directory to essentially the same location on an ephemeral lustre file system. Here is the source directory and its contents that dcp is attempting to copy:

$ tree /lus/global/blake/dm-system-test/src
/lus/global/blake/dm-system-test/src
├── job
│   ├── data2.out
│   ├── data.out
│   └── job2
│       └── data3.out
└── job2
    ├── data2.out
    └── data.out

Here is the error output from dcp:

...
[2024-05-28T17:00:04] Walking /lus/global/blake/dm-system-test/src
[2024-05-28T17:00:04] Walked 9 items in 0.011 secs (807.850 items/sec) ...
[2024-05-28T17:00:04] Walked 9 items in 0.012 seconds (747.192 items/sec)
[2024-05-28T17:00:04] Copying to /mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0
[2024-05-28T17:00:04] Items: 9
[2024-05-28T17:00:04]   Directories: 4
[2024-05-28T17:00:04]   Files: 5
[2024-05-28T17:00:04]   Links: 0
[2024-05-28T17:00:04] Data: 60.000 B (12.000 B per file)
[2024-05-28T17:00:04] Creating 4 directories
[2024-05-28T17:00:04] Original directory exists, skip the creation: `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0' (errno=17 File exists)
[2024-05-28T17:00:04] Creating 5 files.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-28T17:00:04] Copying data.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' (errno=2 No such file or directory)
[2024-05-28T17:00:04] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data2.out' to `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out'
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data.out' to `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out'
[2024-05-28T17:00:04] Copy data: 60.000 B (60 bytes)
[2024-05-28T17:00:04] Copy rate: 2.796 KiB/s (60 bytes in 0.021 seconds)
[2024-05-28T17:00:04] Syncing data to disk.
[2024-05-28T17:00:04] Sync completed in 0.038 seconds.
[2024-05-28T17:00:04] Fixing permissions.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] Updated 9 items in 0.004 seconds (2140.376 items/sec)
[2024-05-28T17:00:04] Syncing directory updates to disk.
[2024-05-28T17:00:04] Sync completed in 0.001 seconds.
[2024-05-28T17:00:04] Started: May-28-2024,17:00:04
[2024-05-28T17:00:04] Completed: May-28-2024,17:00:04
[2024-05-28T17:00:04] Seconds: 0.105
[2024-05-28T17:00:04] Items: 7
[2024-05-28T17:00:04]   Directories: 4
[2024-05-28T17:00:04]   Files: 3
[2024-05-28T17:00:04]   Links: 0
[2024-05-28T17:00:04] Data: 36.000 B (36 bytes)
[2024-05-28T17:00:04] Rate: 343.790 B/s (036 bytes in 0.105 seconds)
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

 Local host:  nnf-dm-worker-7c96s
 System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-7c96s.c7370001.3
 Error:       Operation not permitted (errno 1)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

 Process name: [[50999,1],15]
 Exit code:    1
--------------------------------------------------------------------------

The system call failed during shared memory initialization is something that I see when the copy is successful so I don't think it contributes to the problem here.

Is there anything I can enable to trace this further?

bdevcich commented 1 month ago

Here's another case with dcp -d dbg enabled. According to the debug output, mkdir appears to have worked.

[2024-05-29T17:58:39] Walking /lus/global/blake/dm-system-test/src
   [2024-05-29T17:58:39] Walked 9 items in 0.013 secs (706.261 items/sec) ...
   [2024-05-29T17:58:39] Walked 9 items in 0.014 seconds (660.435 items/sec)
   [2024-05-29T17:58:39] Copying to /mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0
   [2024-05-29T17:58:39] Items: 9
   [2024-05-29T17:58:39]   Directories: 4
   [2024-05-29T17:58:39]   Files: 5
   [2024-05-29T17:58:39]   Links: 0
   [2024-05-29T17:58:39] Data: 60.000 B (12.000 B per file)
   [2024-05-29T17:58:39] Creating 4 directories
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0'
   [2024-05-29T17:58:39] Original directory exists, skip the creation: `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0' (errno=17 File exists)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job'
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2'
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job/job2'
   [2024-05-29T17:58:39] Creating 5 files.
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' mknod() failed (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' mknod() failed (errno=2 No such file or directory)
   [2024-05-29T17:58:39] Copying data.
   [2024-05-29T17:58:39] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [3] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out'
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data2.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out'
   [2024-05-29T17:58:39] Copy data: 60.000 B (60 bytes)
   [2024-05-29T17:58:39] Copy rate: 3.881 KiB/s (60 bytes in 0.015 seconds)
   [2024-05-29T17:58:39] Syncing data to disk.
   [2024-05-29T17:58:39] Sync completed in 0.037 seconds.
   [2024-05-29T17:58:39] Fixing permissions.
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] Updated 9 items in 0.004 seconds (2078.435 items/sec)
   [2024-05-29T17:58:39] Syncing directory updates to disk.
   [2024-05-29T17:58:39] Sync completed in 0.001 seconds.
   [2024-05-29T17:58:39] Started: May-29-2024,17:58:39
   [2024-05-29T17:58:39] Completed: May-29-2024,17:58:39
   [2024-05-29T17:58:39] Seconds: 0.099
   [2024-05-29T17:58:39] Items: 7
   [2024-05-29T17:58:39]   Directories: 4
   [2024-05-29T17:58:39]   Files: 3
   [2024-05-29T17:58:39]   Links: 0
   [2024-05-29T17:58:39] Data: 36.000 B (36 bytes)
   [2024-05-29T17:58:39] Rate: 363.605 B/s (036 bytes in 0.099 seconds)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
daltonbohning commented 1 month ago

I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host)

Does that mount have some kind of caching or delayed update? I'm more familiar with DAOS than Lustre, but I've seen similar behavior with DAOS and certain cache settings where a rank creates the directory on one host, then a rank on a different host doesn't see that directory immediately

behlendorf commented 1 month ago

That's an good thought and could explain the raciness. Lustre is fully coherent across nodes but I'm not sure how dcp itself coordinates this internally. Perhaps it's possible that dcp could issue the file create on host-B before the mkdir(2) syscall on host-A completes?

daltonbohning commented 1 month ago

Perhaps it's possible that dcp could issue the file create on host-B before the mkdir(2) syscall on host-A completes?

I don't think so, unless there's a bug somewhere. This is the function that creates the directories https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L1042 And notice there is a barrier https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L1123 And that function is called before creating the files https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L2490-L2495

As far as I remember (I'm not the primary author!) dcp doesn't directly handle coherency. It creates all directories upfront, barrier, then creates files, barrier, then copies file data

ofaaland commented 1 month ago

Does the job2 directory ever get created at all? In the example above, I see that not only did the file copies to job2 fail, but the chmod on job2 itself also failed. That chmod is a third phase (mkdirs, then file copies, then set metadata), so multiple barriers later.

ofaaland commented 1 month ago

[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)

This is what I'm referring to WRT to the chmod failing. I see this in both the first and second examples you posted. So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.

Can you instrument your test to see what each node sees under the destination root directory after dcp finishes?

daltonbohning commented 1 month ago

So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.

Good point. If mkdir fails, it looks like the copy continues anyway https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L2490-L2499

bdevcich commented 1 month ago

I was able to add a pause in our code after the dcp failure occurs to go take a look at the lustre filesystem on each of the nodes. Obviously this isn't in real time and any sort of caching issue would have time to catch up since I have to do this manually right now, but I was able to take a peek at the two nodes and the files:

node0:

[root@rabbit-node-1 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 98
drwxr-xr-x 5 blake users 33280 May 30 12:44 .
drwx------ 2 blake users 33280 May 30 12:44 job
drwxr-xr-x 2 blake users 33280 May 30 12:44 job2

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users    12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users    12 May 30 12:44 data.out

node1:

[root@rabbit-node-2 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 294
drwxr-xr-x    5 blake users  33280 May 30 12:44 .
drwxr-xr-x 2772 root  root  159744 May 30 12:43 ..
drwx------    2 blake users  33280 May 30 12:44 job
drwxr-xr-x    2 blake users  33280 May 30 12:44 job2

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users    12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users    12 May 30 12:44 data.out

You can see that the job/ directory does get created in the end result.

However, in this run, I'm seeing an actual error message for a failed mkdir(). Here is the output:

...
[2024-05-30T17:44:03] Debug level set to: debug
...
[2024-05-30T17:44:03] Walking /lus/global/blake/dm-system-test/src
[2024-05-30T17:44:03] Walked 9 items in 0.024 secs (373.053 items/sec) ...
[2024-05-30T17:44:03] Walked 9 items in 0.025 seconds (354.708 items/sec)
[2024-05-30T17:44:03] Copying to /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
[2024-05-30T17:44:03] Items: 9
[2024-05-30T17:44:03]   Directories: 4
[2024-05-30T17:44:03]   Files: 5
[2024-05-30T17:44:03]   Links: 0
[2024-05-30T17:44:03] Data: 60.000 B (12.000 B per file)
[2024-05-30T17:44:03] Creating 4 directories
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0'
[2024-05-30T17:44:03] Original directory exists, skip the creation: `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0' (errno=17 File exists)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1010] ERROR: Create `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' mkdir() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Creating 5 files.
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Copying data.
[2024-05-30T17:44:03] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' (errno=2 No such file or di
rectory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out'
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out'
[2024-05-30T17:44:03] Copy data: 60.000 B (60 bytes)
[2024-05-30T17:44:03] Copy rate: 2.696 KiB/s (60 bytes in 0.022 seconds)
[2024-05-30T17:44:03] Syncing data to disk.
[2024-05-30T17:44:03] Sync completed in 0.037 seconds.
[2024-05-30T17:44:03] Fixing permissions.
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] Updated 9 items in 0.009 seconds (1052.615 items/sec)
[2024-05-30T17:44:03] Syncing directory updates to disk.
[2024-05-30T17:44:03] Sync completed in 0.001 seconds.
[2024-05-30T17:44:03] Started: May-30-2024,17:44:03
[2024-05-30T17:44:03] Completed: May-30-2024,17:44:03
[2024-05-30T17:44:03] Seconds: 0.121
[2024-05-30T17:44:03] Items: 5
[2024-05-30T17:44:03]   Directories: 3
[2024-05-30T17:44:03]   Files: 2
[2024-05-30T17:44:03]   Links: 0
[2024-05-30T17:44:03] Data: 24.000 B (24 bytes)
[2024-05-30T17:44:03] Rate: 198.718 B/s (024 bytes in 0.121 seconds)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)

So even though job/ does show up in the end result, any sort of action in that directory fails. dcp fails to make job/job2/, job/data.out, and job/data2.out.

adilger commented 1 month ago

What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.

behlendorf commented 1 month ago

@adilger we're seeing this with 2.15.4_6.llnl which does include the fix for LU-16056. For what it's worth we didn't notice this issue with 2.15.3_5.llnl but haven't yet tried rolling back to the servers. We do have a small patch stack applied but we had most of those changes applied to both Lustre versions and nothing in there really jumps out at me. It's mostly kfilnd/lnet fixes.

adilger commented 1 month ago

@behlendorf is this reproducible enough that it could be bisected between 2.15.3 and 2.15.4?

ofaaland commented 1 month ago

@bdevcich can you modify your build of mpifileutils so dcp does a stat() and reports ownership + permissions on those directories after the mkdir returns?

bdevcich commented 1 month ago

@bdevcich can you modify your build of mpifileutils so dcp does a stat() and reports ownership + permissions on those directories after the mkdir returns?

I'm back from vacation and I will start on this.

bdevcich commented 1 month ago

Here is the output with some added debugging lines from the stat():

   [2024-06-05T15:00:57] Walking /lus/global/blake/dm-system-test/src
   [2024-06-05T15:00:57] Walked 9 items in 0.013 secs (702.758 items/sec) ...
   [2024-06-05T15:00:57] Walked 9 items in 0.014 seconds (660.454 items/sec)
   [2024-06-05T15:00:57] Copying to /mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0
   [2024-06-05T15:00:57] Items: 9
   [2024-06-05T15:00:57]   Directories: 4
   [2024-06-05T15:00:57]   Files: 5
   [2024-06-05T15:00:57]   Links: 0
   [2024-06-05T15:00:57] Data: 60.000 B (12.000 B per file)
   [2024-06-05T15:00:57] Creating 4 directories
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] Original directory exists, skip the creation: `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0' (errno=17 File exists)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045]    mode: 40755
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951]    Permissions (octal): 755
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958]    Non-permission bits (octal): 40000
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047]    UID: 1060
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048]    GID: 100
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049]    numLinks: 3
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045]    mode: 40700
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951]    Permissions (octal): 700
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958]    Non-permission bits (octal): 40000
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047]    UID: 1060
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048]    GID: 100
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049]    numLinks: 2
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1028] ERROR: Create `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' mkdir() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1054] ERROR: TEST: mkdir failed, done
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1153] ERROR: TEST: Error creating directory, rc: -1, destpath: �3�V
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2537] ERROR: Error creating directories, rc: -1
   [2024-06-05T15:00:57] Creating 5 files.
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] Copying data.
   [2024-06-05T15:00:57] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out'
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out'
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out'
   [2024-06-05T15:00:57] Copy data: 60.000 B (60 bytes)
   [2024-06-05T15:00:57] Copy rate: 3.742 KiB/s (60 bytes in 0.016 seconds)
   [2024-06-05T15:00:57] Syncing data to disk.
   [2024-06-05T15:00:57] Sync completed in 0.036 seconds.
   [2024-06-05T15:00:57] Fixing permissions.
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] Updated 9 items in 0.003 seconds (3391.414 items/sec)
   [2024-06-05T15:00:57] Syncing directory updates to disk.
   [2024-06-05T15:00:57] Sync completed in 0.001 seconds.
   [2024-06-05T15:00:57] Started: Jun-05-2024,15:00:57
   [2024-06-05T15:00:57] Completed: Jun-05-2024,15:00:57
   [2024-06-05T15:00:57] Seconds: 0.092
   [2024-06-05T15:00:57] Items: 5
   [2024-06-05T15:00:57]   Directories: 3
   [2024-06-05T15:00:57]   Files: 2
   [2024-06-05T15:00:57]   Links: 0
   [2024-06-05T15:00:57] Data: 24.000 B (24 bytes)
   [2024-06-05T15:00:57] Rate: 261.025 B/s (024 bytes in 0.092 seconds)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
ofaaland commented 1 month ago

@bdevcich I see the stat() of the job directory returned ENOENT

   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)

It's concerning that mkdir would fail. @behlendorf tells me you're able to reproduce this on elcap, so I'll try myself once the nodes are back up post firmware update.

bdevcich commented 1 month ago

What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.

On our internal system where I see this, we're using lustre-2.15.4_4.llnl-1.t4.x86_64

bdevcich commented 1 month ago

As I mentioned in the rabbit issue, if I put a 30s pause after the mount of ephemeral lustre, I cannot reproduce this issue.

I've removed the pause and did my best to capture the lustre logs from the two rabbit nodes when this occurs. I did a lctl dk after the issue occurred and I did my best to try to do a lctl clear after the successful runs to try to limit the size of the logs.

mknod-issue.rabbit-node-1.log mknod-issue.rabbit-node-2.log