open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

romio meaning of fp_ind for location within a file view #4507

Open markalle opened 6 years ago

markalle commented 6 years ago

When romio flattens datatypes it sometimes has 0-byte chunks in the output. For example

    MPI_Aint lb, extent;
    MPI_Datatype newtype;
    lb = 4;
    extent = 12;
    MPI_Type_create_resized(MPI_INT, lb, extent, &newtype);
    MPI_Type_commit(&newtype);

    flattened offsets = { 4, 0, 16 }
    flattened byte counts = { 0, 4, 0 }
    Here the romio flattened representation is including two 0-byte chunks for LB and UB

In general the problem I'll describe below is the code that updates and interprets the fd->fp_ind location pointer isn't equipped to understand those 0-byte chunks in romio's flattened datatype representation. I'd like to remove those from the flattened datatype, but I don't have a complete enough understanding of romio to know if they're needed elsewhere.

The example code path I'll walk through here to demonstrated the bug is ad_write_str_naive.c in ADIOI_GEN_WriteStrided_naive() for the contig-in-buftype and not-contig-in-filetype path, but I think similar code exists quite a few places.

Romio maintains a "current file pointer" fd->fp_ind which tries to indicate the next location for data to be written. When using a view, a writer tries to translate this into a datatype index a chunk index and a first-write size (number of bytes of the current chunk left to write). In the code these are often n_filetypes, st_index, and fwr_size. As bytes are written the n_filetypes and st_index are iterated over for the incoming byte count. At the end the fd->fp_ind is updated.

First it's worth noting that the initial setting of fp_ind at the bottom of ad_set_view.c does skip over those 0-byte chunks, so the inital fp_ind value is 0 as it should be, eg the location data will actually go for the next write.

My testcase calls MPI_File_write(fh, buf, 2, MPI_INT, MPI_STATUS_IGNORE); which goes to ~line 130 in ad_write_str_naive.c where it comes up with n_filetypes=0, st_index=0, fwr_size=4.

Here it's already misidentified the starting point. That st_index should be 1. And the fact that it's got an initial fwr_size=4 means it actually writes data for the starting index even though it's on a 0-byte chunk of the datatype's flat representation.

But continuing further down around line 230 where it has a while loop with userbuf_off, that's where it writes the incoming 8 bytes while walking through the type/chunk pairs. At the bottom of that while loop where it updates off it keeps iterating to the next type/chunk as long as there are more bytes to write. But once the userbuf_off reaches bufsize, off is left at the "next" chunk without regard to whether it's real data or a 0-byte chunk. This is asymmetrical with file_set_view behavior and leads to ambiguity about the meaning of fp_ind making it impossible to correctly map an fp_ind back to a type/chunk (the off value updated there is what goes into fp_ind a little further down).

I made a partial fix where I encapsulated the concepts of updating an fp_ind value based on having accessed x number of bytes, and of translating an fp_ind into a type/chunk pair. It worked for the codepaths I updated, but overall I feel like there are a lot of paths that would need changed for that to be a complete fix. I'd rather take the path of removing those 0-byte chunks, then I believe the current fp_ind updating/interpreting code that's scattered throughout romio will probably all work as is. Opinions on whether that would break other uses of the flattened type?

To reproduce the issue I modified ad_nfs.c to use ADIOI_GEN_WriteStrided_naive:

Then the below testcase can demonstrate the problem: https://gist.github.com/markalle/1eb7cace99ba5497fe595037bec993d2 % mpicc -o x romio_types.c % mpirun -np 1 -mca io romio314 ./x

gpaulsen commented 6 years ago

Adding @nysal and @edgargabriel