microsoft / Microsoft-MPI

Microsoft MPI
MIT License
244 stars 74 forks source link

MPI_Win_allocate_shared throws error when size==0 #41

Open csi-dweiner opened 4 years ago

csi-dweiner commented 4 years ago

Hello! I found a bug using MS-MPI 10.1. MPI_Win_allocate_shared fails whenever size==0. This is explicitly supported according to the documentation:

The size argument may be different at each process and size = 0 is valid. https://docs.microsoft.com/en-us/message-passing-interface/mpi-win-allocate-shared-function

Allocating shm window: size=1 stride=1...OK.
Allocating shm window: size=0 stride=1...
job aborted:
[ranks] message

[0] fatal error
Fatal error in MPI_Win_allocate_shared: Other MPI error, error stack:
MPI_Win_allocate_shared(size=-1015819392, disp_unit=0, info=0x1, comm=0x1c000000, baseptr=0x00007FF844000000, win=0x00000097526FFD00) failed
CreateFileMapping failed, error 87

I think the issue is that under the hood, MS-MPI calls MPID_Win_create_[non]contig, which in turn calls CreateFileMappingW, whose documented behavior is:

An attempt to map a file with a length of 0 (zero) fails with an error code of ERROR_FILE_INVALID. Applications should test for files with a length of 0 (zero) and reject those files. https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-createfilemappingw

Further complicating things when investigating this, there is also a bug in the error message above: All the parameters are being printed in the wrong fields. size is displaying the value of baseptr, and all the other parameters are off by one (disp_size is displaying size==0, info is displaying disp_size==1...) I did find the code issue causing this display issue; baseptr should be moved to second-to-last in the parameter list here:

https://github.com/microsoft/Microsoft-MPI/blob/7ff6bdcdb1d5dc7b791e47457ee2686cd6b3d355/src/mpi/msmpi/api/mpi_win.cpp#L494

Thank you!

--Dan Weiner, HPC Research Engineer, Convergent Science (convergecfd.com)

csi-dweiner commented 4 years ago

Update: It appears that once a shm window of nonzero size has been created, ranks allocating an "additional" 0 bytes do not cause a problem.