Closed ericch1 closed 2 years ago
thanks for flagging this. it does seem like I'm at fault here... can you tell me more about the platform? I'm bothered that my integration testing did not find this.
I am on Suse Leap 15.3.
The MPI_File_open call is made on a file into a encrypted partition on a local disk (no NFS here).
I have 428 failures on 1555 tests. So some tests are able to open the files, other not...
I am investigating by hand 2 tests (let's say A and B):
A)
i. If I launch with mpiexec -n 1
the test is ok!
ii. I launched with mpiexec -n 2 valgrind ...
but discovered only a few leaks, nothing harmful
B) If I launch with mpiexec -n 1
the test is still faulty but goes further: it fails on an "output" file instead of an input file... the error code is different (is it normal?): 1007217952 instead of 1006693664
I passed a MPI_INFO_NULL
instead of the usual MPI_Info
but it changed nothing...
Since I have compiled mpich into debug mode, I can follow the trace of the execution, maybe you can point me to where I should place a breakpoint?
In my modifications I made a romio_statfs
routine to hide all the many different ways operating systems do stat
. I would look at that routine -- the output value file_id
should be a file system "magic value" like the ones enumerated in https://github.com/pmodels/mpich/blob/main/src/mpi/romio/adio/common/ad_fstype.c#L73 or UNKNOWN_SUPER_MAGIC
which should tell ROMIO "i could not find anything specific so I will treat this like a generic POSIX file system".
ADIO_FileSysType_fncall
calls romio_statfs
on either the file or the file's parent directory. Its role is to turn the file system magic value (or UNKNOWN_SUPER_MAGIC) into the appropriate ROMIO file system identifier
(now that I think about it , there's no need for both file system magic values and ROMIO file system identifiers... inelegant but not defective, I think?)
Once we have that identifier we can search the 'fstypes' table to map identifers to table of function pointers. mapping is https://github.com/pmodels/mpich/blob/main/src/mpi/romio/adio/common/ad_fstype.c#L160 searching the map is done in e.g. https://github.com/pmodels/mpich/blob/main/src/mpi/romio/adio/common/ad_fstype.c#L642
Now that I type all that out... phew what a mess.
It sounds from your description (fails on output file) that I need to look more closely at the "check parent dir if given file is not found" logic (https://github.com/pmodels/mpich/blob/main/src/mpi/romio/adio/common/ad_fstype.c#L401)
Hmm, talking of "parent dir" reminds me that since a long time ago, we got an issue with MPI_File_open
with very long paths... since then we always call chdir
before any call to MPI_File_open
... and after we switch back to the pwd with chdir
(the ticket was in 2014: https://github.com/pmodels/mpich/issues/2212)...
ok, just tried a very simple example on my system and it just fail at "MPI_File_open" call with the compiled mpich/master...
See source code fo.c (as fo.txt): fo.txt
If I source mpich/master compiled/configured as above:
source /opt/mpich-3.x_debug/mpilibs.sh
(20:59:49) [lorien]:tmp> which mpicc
/opt/mpich-3.x_debug/bin/mpicc
(20:59:55) [lorien]:tmp> mpicc -o fo fo.c
(21:00:03) [lorien]:tmp> ./fo
Unable to open file "temp"
MPI_Error_string: Other I/O error , error stack:
ADIO_RESOLVEFILETYPE(650): Specified filesystem is not available
with older mpich 3.3.2:
source /opt/mpich-3.3.2/mpilibs.sh
(21:01:32) [lorien]:tmp> which mpicc
/opt/mpich-3.3.2/bin/mpicc
(21:01:36) [lorien]:tmp> mpicc -o fo fo.c
(21:01:39) [lorien]:tmp> ./fo
Success
that's a wonderfully simple test case. Of course it prints 'Success' for me and not an error.
Can you run stat -f
on this directory? for example on my laptop I see
% stat -f .
File: "."
ID: fedc9aa3bd65bc57 Namelen: 255 Type: ext2/ext3
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 65793553 Free: 36657953 Available: 33298414
Inodes: Total: 16777216 Free: 15579348
or on my worksatation I see
% stat -f ${HOME}
File: "/home/robl"
ID: 0 Namelen: 255 Type: nfs
Block size: 32768 Fundamental block size: 32768
Blocks: Total: 1638400 Free: 1556782 Available: 1556782
Inodes: Total: 99727275 Free: 99634022
both of which seem to handle your case just fine
Here I have:
./fo
Unable to open file "temp"
MPI_Error_string: Other I/O error , error stack:
ADIO_RESOLVEFILETYPE(650): Specified filesystem is not available
(10:27:31) [lorien]:tmp> stat -f .
File: "."
ID: fe0000000000 Namelen: 255 Type: xfs
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 244070463 Free: 151390632 Available: 151390632
Inodes: Total: 488379392 Free: 486352124
It works on NFS:
./fo
Success
(10:30:44) [lorien]:~> stat -f .
File: "."
ID: 0 Namelen: 255 Type: nfs
Block size: 1048576 Fundamental block size: 1048576
Blocks: Total: 1877657 Free: 343916 Available: 248513
Inodes: Total: 122101760 Free: 86046440
hah! thank you that must be the key. 'xfs' is a special case: a file system that might contain XFS-specific optimizations (some work our SGI friends contributed ...back when SGI was a thing) but should normally be treated like a regular posix file system.
OK, i can patch this up thanks for the information!
give https://github.com/pmodels/mpich/pull/5781 a shot . I'm running it through the integration tests now.
give #5781 a shot . I'm running it through the integration tests now.
It is now working for the little test shared with you. :)
I now test our whole test suite and will comme back in a few hours...
Thanks a lot!
@roblatham00 all of tests are ok now with your fix, thanks a lot! :)
http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_config.system http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_mpich_version.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_c.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_m.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_mi.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_mpl_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_pm_hydra_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.22.07h42m08s_mpiexec_info.txt
Hi,
Our nightly tests were ok on Jan 7 with commit a92ce59da2d429664570779d1e693288bd40e40f . But on Jan 8, we compiled with commit ea9b0c7e61602eecd08b6ca9953077beb173485a
and now we have a lot of errors like this one:
issued by a
MPI_File_open
everything is fine with mpich-3.2.1 and OpenMPI-3.x, OpenMPI-4.x and OpenMPI-5.x
Just saw the commit f77d6f71b902410ba99e9c01dd003f5afba7f0c0 from @roblatham00 which may be the problem?
Thanks a lot!
Eric here are all the build logs from MPICH and PETSc:
http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_config.system http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_mpich_version.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_c.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_m.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_mi.txt http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_mpl_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_pm_hydra_tools_topo_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_mpiexec_info.txt
http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_configure.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_make.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_RDict.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_make_test.log http://www.giref.ulaval.ca/~cmpgiref/dernier_mpich/2022.01.08.05h36m02s_make_streams.log