Closed bdevcich closed 1 month ago
Here's another case with dcp -d dbg
enabled. According to the debug output, mkdir
appears to have worked.
[2024-05-29T17:58:39] Walking /lus/global/blake/dm-system-test/src
[2024-05-29T17:58:39] Walked 9 items in 0.013 secs (706.261 items/sec) ...
[2024-05-29T17:58:39] Walked 9 items in 0.014 seconds (660.435 items/sec)
[2024-05-29T17:58:39] Copying to /mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0
[2024-05-29T17:58:39] Items: 9
[2024-05-29T17:58:39] Directories: 4
[2024-05-29T17:58:39] Files: 5
[2024-05-29T17:58:39] Links: 0
[2024-05-29T17:58:39] Data: 60.000 B (12.000 B per file)
[2024-05-29T17:58:39] Creating 4 directories
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0'
[2024-05-29T17:58:39] Original directory exists, skip the creation: `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0' (errno=17 File exists)
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job'
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2'
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job/job2'
[2024-05-29T17:58:39] Creating 5 files.
[2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-29T17:58:39] Copying data.
[2024-05-29T17:58:39] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' (errno=2 No such file or directory)
[2024-05-29T17:58:39] [3] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' (errno=2 No such file or directory)
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out'
[2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data2.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out'
[2024-05-29T17:58:39] Copy data: 60.000 B (60 bytes)
[2024-05-29T17:58:39] Copy rate: 3.881 KiB/s (60 bytes in 0.015 seconds)
[2024-05-29T17:58:39] Syncing data to disk.
[2024-05-29T17:58:39] Sync completed in 0.037 seconds.
[2024-05-29T17:58:39] Fixing permissions.
[2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' chmod() (errno=2 No such file or directory)
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)
[2024-05-29T17:58:39] Updated 9 items in 0.004 seconds (2078.435 items/sec)
[2024-05-29T17:58:39] Syncing directory updates to disk.
[2024-05-29T17:58:39] Sync completed in 0.001 seconds.
[2024-05-29T17:58:39] Started: May-29-2024,17:58:39
[2024-05-29T17:58:39] Completed: May-29-2024,17:58:39
[2024-05-29T17:58:39] Seconds: 0.099
[2024-05-29T17:58:39] Items: 7
[2024-05-29T17:58:39] Directories: 4
[2024-05-29T17:58:39] Files: 3
[2024-05-29T17:58:39] Links: 0
[2024-05-29T17:58:39] Data: 36.000 B (36 bytes)
[2024-05-29T17:58:39] Rate: 363.605 B/s (036 bytes in 0.099 seconds)
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host)
Does that mount have some kind of caching or delayed update? I'm more familiar with DAOS than Lustre, but I've seen similar behavior with DAOS and certain cache settings where a rank creates the directory on one host, then a rank on a different host doesn't see that directory immediately
That's an good thought and could explain the raciness. Lustre is fully coherent across nodes but I'm not sure how dcp
itself coordinates this internally. Perhaps it's possible that dcp
could issue the file create on host-B before the mkdir(2)
syscall on host-A completes?
Perhaps it's possible that
dcp
could issue the file create on host-B before themkdir(2)
syscall on host-A completes?
I don't think so, unless there's a bug somewhere. This is the function that creates the directories https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L1042 And notice there is a barrier https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L1123 And that function is called before creating the files https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L2490-L2495
As far as I remember (I'm not the primary author!) dcp doesn't directly handle coherency. It creates all directories upfront, barrier, then creates files, barrier, then copies file data
Does the job2 directory ever get created at all? In the example above, I see that not only did the file copies to job2 fail, but the chmod on job2 itself also failed. That chmod is a third phase (mkdirs, then file copies, then set metadata), so multiple barriers later.
[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)
This is what I'm referring to WRT to the chmod failing. I see this in both the first and second examples you posted. So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.
Can you instrument your test to see what each node sees under the destination root directory after dcp finishes?
So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.
Good point. If mkdir fails, it looks like the copy continues anyway https://github.com/hpc/mpifileutils/blob/0a4c530bca5dc2476418a3833a15ab812d636e78/src/common/mfu_flist_copy.c#L2490-L2499
I was able to add a pause in our code after the dcp failure occurs to go take a look at the lustre filesystem on each of the nodes. Obviously this isn't in real time and any sort of caching issue would have time to catch up since I have to do this manually right now, but I was able to take a peek at the two nodes and the files:
node0:
[root@rabbit-node-1 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 98
drwxr-xr-x 5 blake users 33280 May 30 12:44 .
drwx------ 2 blake users 33280 May 30 12:44 job
drwxr-xr-x 2 blake users 33280 May 30 12:44 job2
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users 12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users 12 May 30 12:44 data.out
node1:
[root@rabbit-node-2 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 294
drwxr-xr-x 5 blake users 33280 May 30 12:44 .
drwxr-xr-x 2772 root root 159744 May 30 12:43 ..
drwx------ 2 blake users 33280 May 30 12:44 job
drwxr-xr-x 2 blake users 33280 May 30 12:44 job2
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users 12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users 12 May 30 12:44 data.out
You can see that the job/
directory does get created in the end result.
However, in this run, I'm seeing an actual error message for a failed mkdir(). Here is the output:
...
[2024-05-30T17:44:03] Debug level set to: debug
...
[2024-05-30T17:44:03] Walking /lus/global/blake/dm-system-test/src
[2024-05-30T17:44:03] Walked 9 items in 0.024 secs (373.053 items/sec) ...
[2024-05-30T17:44:03] Walked 9 items in 0.025 seconds (354.708 items/sec)
[2024-05-30T17:44:03] Copying to /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
[2024-05-30T17:44:03] Items: 9
[2024-05-30T17:44:03] Directories: 4
[2024-05-30T17:44:03] Files: 5
[2024-05-30T17:44:03] Links: 0
[2024-05-30T17:44:03] Data: 60.000 B (12.000 B per file)
[2024-05-30T17:44:03] Creating 4 directories
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0'
[2024-05-30T17:44:03] Original directory exists, skip the creation: `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0' (errno=17 File exists)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1010] ERROR: Create `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' mkdir() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Creating 5 files.
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Copying data.
[2024-05-30T17:44:03] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' (errno=2 No such file or di
rectory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out'
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out'
[2024-05-30T17:44:03] Copy data: 60.000 B (60 bytes)
[2024-05-30T17:44:03] Copy rate: 2.696 KiB/s (60 bytes in 0.022 seconds)
[2024-05-30T17:44:03] Syncing data to disk.
[2024-05-30T17:44:03] Sync completed in 0.037 seconds.
[2024-05-30T17:44:03] Fixing permissions.
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] Updated 9 items in 0.009 seconds (1052.615 items/sec)
[2024-05-30T17:44:03] Syncing directory updates to disk.
[2024-05-30T17:44:03] Sync completed in 0.001 seconds.
[2024-05-30T17:44:03] Started: May-30-2024,17:44:03
[2024-05-30T17:44:03] Completed: May-30-2024,17:44:03
[2024-05-30T17:44:03] Seconds: 0.121
[2024-05-30T17:44:03] Items: 5
[2024-05-30T17:44:03] Directories: 3
[2024-05-30T17:44:03] Files: 2
[2024-05-30T17:44:03] Links: 0
[2024-05-30T17:44:03] Data: 24.000 B (24 bytes)
[2024-05-30T17:44:03] Rate: 198.718 B/s (024 bytes in 0.121 seconds)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
So even though job/
does show up in the end result, any sort of action in that directory fails. dcp
fails to make job/job2/
, job/data.out
, and job/data2.out
.
What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.
@adilger we're seeing this with 2.15.4_6.llnl which does include the fix for LU-16056. For what it's worth we didn't notice this issue with 2.15.3_5.llnl but haven't yet tried rolling back to the servers. We do have a small patch stack applied but we had most of those changes applied to both Lustre versions and nothing in there really jumps out at me. It's mostly kfilnd/lnet fixes.
@behlendorf is this reproducible enough that it could be bisected between 2.15.3 and 2.15.4?
@bdevcich can you modify your build of mpifileutils so dcp
does a stat() and reports ownership + permissions on those directories after the mkdir returns?
@bdevcich can you modify your build of mpifileutils so
dcp
does a stat() and reports ownership + permissions on those directories after the mkdir returns?
I'm back from vacation and I will start on this.
Here is the output with some added debugging lines from the stat()
:
[2024-06-05T15:00:57] Walking /lus/global/blake/dm-system-test/src
[2024-06-05T15:00:57] Walked 9 items in 0.013 secs (702.758 items/sec) ...
[2024-06-05T15:00:57] Walked 9 items in 0.014 seconds (660.454 items/sec)
[2024-06-05T15:00:57] Copying to /mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0
[2024-06-05T15:00:57] Items: 9
[2024-06-05T15:00:57] Directories: 4
[2024-06-05T15:00:57] Files: 5
[2024-06-05T15:00:57] Links: 0
[2024-06-05T15:00:57] Data: 60.000 B (12.000 B per file)
[2024-06-05T15:00:57] Creating 4 directories
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
[2024-06-05T15:00:57] Original directory exists, skip the creation: `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0' (errno=17 File exists)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045] mode: 40755
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951] Permissions (octal): 755
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958] Non-permission bits (octal): 40000
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047] UID: 1060
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048] GID: 100
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049] numLinks: 3
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045] mode: 40700
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951] Permissions (octal): 700
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958] Non-permission bits (octal): 40000
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047] UID: 1060
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048] GID: 100
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049] numLinks: 2
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1028] ERROR: Create `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' mkdir() failed (errno=2 No such file or directory)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' (errno=2 No such file or directory)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1054] ERROR: TEST: mkdir failed, done
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1153] ERROR: TEST: Error creating directory, rc: -1, destpath: �3�V
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2537] ERROR: Error creating directories, rc: -1
[2024-06-05T15:00:57] Creating 5 files.
[2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' mknod() failed (errno=2 No such file or directory)
[2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
[2024-06-05T15:00:57] Copying data.
[2024-06-05T15:00:57] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' (errno=2 No such file or directory)
[2024-06-05T15:00:57] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' (errno=2 No such file or directory)
[2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' (errno=2 No such file or directory)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out'
[2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out'
[2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out'
[2024-06-05T15:00:57] Copy data: 60.000 B (60 bytes)
[2024-06-05T15:00:57] Copy rate: 3.742 KiB/s (60 bytes in 0.016 seconds)
[2024-06-05T15:00:57] Syncing data to disk.
[2024-06-05T15:00:57] Sync completed in 0.036 seconds.
[2024-06-05T15:00:57] Fixing permissions.
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
[2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' chmod() (errno=2 No such file or directory)
[2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' chmod() (errno=2 No such file or directory)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' chmod() (errno=2 No such file or directory)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' chmod() (errno=2 No such file or directory)
[2024-06-05T15:00:57] Updated 9 items in 0.003 seconds (3391.414 items/sec)
[2024-06-05T15:00:57] Syncing directory updates to disk.
[2024-06-05T15:00:57] Sync completed in 0.001 seconds.
[2024-06-05T15:00:57] Started: Jun-05-2024,15:00:57
[2024-06-05T15:00:57] Completed: Jun-05-2024,15:00:57
[2024-06-05T15:00:57] Seconds: 0.092
[2024-06-05T15:00:57] Items: 5
[2024-06-05T15:00:57] Directories: 3
[2024-06-05T15:00:57] Files: 2
[2024-06-05T15:00:57] Links: 0
[2024-06-05T15:00:57] Data: 24.000 B (24 bytes)
[2024-06-05T15:00:57] Rate: 261.025 B/s (024 bytes in 0.092 seconds)
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
@bdevcich I see the stat() of the job directory returned ENOENT
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
[2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)
It's concerning that mkdir would fail. @behlendorf tells me you're able to reproduce this on elcap, so I'll try myself once the nodes are back up post firmware update.
What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.
On our internal system where I see this, we're using lustre-2.15.4_4.llnl-1.t4.x86_64
As I mentioned in the rabbit issue, if I put a 30s pause after the mount of ephemeral lustre, I cannot reproduce this issue.
I've removed the pause and did my best to capture the lustre logs from the two rabbit nodes when this occurs. I did a lctl dk
after the issue occurred and I did my best to try to do a lctl clear
after the successful runs to try to limit the size of the logs.
Coming back to this, I have not been able to reproduce this on systems with:
However, I can reliably reproduce this on HPE systems that are still running:
To me, this indicates a lustre issue. I think this can be closed.
Hello,
I have been seeing this issue when using
dcp
to recursive copy a directory from one lustre filesystem to another. This is being ran viampirun
and over multiple hosts (e.g.mpirun -H host1,host2
). I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host).The error in the
dcp
output suggestsno such file or directory
, which seems like the parent directory (e.g./mnt/nnf/.../job2/
) is not being created first. I originally thought theOriginal directory exists, skip the creation
output might suggest that none of the directories are being created, but that output shows up the case where this issue does not surface.I don't have a way to reproduce this directly, but if I run our tests enough times this will be hit.
Each test is copying the same source directory to essentially the same location on an ephemeral lustre file system. Here is the source directory and its contents that
dcp
is attempting to copy:Here is the error output from
dcp
:The
system call failed during shared memory initialization
is something that I see when the copy is successful so I don't think it contributes to the problem here.Is there anything I can enable to trace this further?