open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

MPI_File_Write* bug due to uncatched write(v) calls for iov_len > 2,147,479,552 bytes #2399

Open cniethammer opened 7 years ago

cniethammer commented 7 years ago

There is a bug with MPI_File_write* operations in Open MPI which truncates files or writes incorrect data to files. I broke the problem down to the POSIX writev call used in ompi/mca/fbtl/posix/fbtl_posix_pwritev.c that will not write more than 2,147,479,552 bytes to disk. From man 2 write NOTES:

On Linux, write() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.) We do not catch the case when the iov contains elements with iov_len > 2,147,479,552 at the moment which causes this issues with large I/O blocks and files.

My suggestions: Either (1) check for writes completing with the correct number of written bytes and adding another write or (2) modify the convertor used in mca_io_ompio_build_io_array/ompi_io_ompio_decode_datatype to build an iov struct with elements not larger than 2,147,479,552 bytes when using POSIX I/O on Linux systems.

edgargabriel commented 7 years ago

Christoph,

thank you for your bug report and analysis. I think the second solution is already implemented, at least for blocking operations, it is just not active by default. Could you try to rerun your test case and set the mca parameter

mca_io_ompio_cycle_buffer_size

to something (e.g. 1 GB or similar, value has to be given in bytes however), and see whether this fixes the problem? If this solution works, we might just have to change the default value for this mca parameter.

THis would not solve the issue for non-blocking operations. We do have the logic for splitting the write sequences into multiple chunks for the aio_write function calls in the posix fbtl already in the code, just in a slightly different context. I will look into that part as well.

Also as a side note, I am fairly confident that the problem should not arise for collective write operations, since they are broken down into cycles of e.g. 32MB anyway.

Thanks

Edgar

On 11/11/2016 3:58 AM, cniethammer wrote:

There is a bug with MPI_File_write* operations in Open MPI which truncates files or writes incorrect data to files. I broke the problem down to the POSIX writev call used in ompi/mca/fbtl/posix/fbtl_posix_pwritev.c that will not write more than 2,147,479,552 bytes to disk. From man 2 write NOTES:

On Linux, write() (and similar system calls) will transfer at most
0x7ffff000 (2,147,479,552) bytes,
returning the number of bytes actually transferred. (This is true
on both 32-bit and 64-bit systems.)
We do not catch the case when the iov contains elements with
iov_len > 2,147,479,552 at the moment which causes this issues
with large I/O blocks and files.

My suggestions: Either (1) check for writes completing with the correct number of written bytes and adding another write or (2) modify the convertor used in mca_io_ompio_build_io_array/ompi_io_ompio_decode_datatype to build an iov struct with elements not larger than 2,147,479,552 bytes when using POSIX I/O on Linux systems.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/open-mpi/ompi/issues/2399, or mute the thread https://github.com/notifications/unsubscribe-auth/AH22HibitjylZCy1K-zKXowC12PmD0dVks5q9Dw7gaJpZM4KvmAC.

cniethammer commented 7 years ago

Hello Edgar,

Thanks for pointing out this option. I can confirm that setting -mca mca_io_ompio_cycle_buffer_size 1073741824 results in writing the correct amount of data to disk. I welcome setting a default which is safe for the user - e.g. to the Linux write limit - accepting a possible performance loss, which is obviously better than losing data at this point.

Best Christoph

edgargabriel commented 7 years ago

Christoph, I am open to any reasonable value, can you make a suggestion? Thanks Edgar

jsquyres commented 7 years ago

@edgargabriel @cniethammer If you guys want this in v2.0.2, please come to a decision ASAP (i.e., today/tomorrow). Otherwise we're going to have to push this to v2.0.3.

edgargabriel commented 7 years ago

this is the bug fix that was merged a couple of days back with the default value for the parameter. So it is done for now, I will work on the second solution that we discussed over the break.

jsquyres commented 7 years ago

Ah! Ok, my bad -- I should have read closer / remembered better. 😊

jsquyres commented 7 years ago

@edgargabriel @cniethammer It's been about 2 months since the last update -- any progress? The door for v2.1.0 is just about closed. I'm going to pre-emptively move this to v2.1.1; feel free to move to v3.0.0 if you think that is more realistic.

edgargabriel commented 7 years ago

feel free to move it to 3.0, I will not get to it. The fix that we committed for the 2.0 series (and 2.1, but will double check that), should do the trick for now.

jsquyres commented 7 years ago

@edgargabriel Done; thanks.

hppritcha commented 7 years ago

@edgargabriel should we just move this to future or do you think you can get a fix in to 3.0.1?

edgargabriel commented 7 years ago

@hppritcha it is on my to do list for this summer, but I don't think it will make it for 3.0.1, I hope to have it ready for 3.1 (or 4.0 whatever the next release is).