ROMIO: excessive number of calls to memcpy()

wkliao commented 5 months ago

A PnetCDF user reported a poor performance of collective writes when using a non-contiguous write buffer. The root of problem is due to a large number of calls to memcpy() in ADIOI_BUF_COPY in mpich/src/mpi/romio/adio/common/ad_write_coll.c

A performance reproducer is available in https://github.com/wkliao/mpi-io-examples/blob/master/tests/pio_noncontig.c

This program makes a single call to MPI_File_write_at_all. The user buffer can be either contiguous (command-line option -g 0) or noncontiguous (default). The noncontiguous case adds a gap of 16 bytes into the buffer. The file view consists of multiple subarray data types, appended one after another. Further description about the I/O pattern can be found at the beginning of the program file.

Running this program on a Linux machine using UFS ADIO driver on 16 MPI processes reported run times of 33.07 and 8.27 seconds. The former is when the user buffer is noncontiguous and the latter contiguous. The user buffer on each process is of size 32 MB. The noncontiguous case adds a gap of size 16 bytes into the buffer. The run command used:

    mpiexec -n 16 ./pio_noncontig -k 256 -c 32768 -w
    mpiexec -n 16 ./pio_noncontig -k 256 -c 32768 -w -g 0

The following patch if applied to MPICH prints the number of calls to memcpy(). https://github.com/wkliao/mpi-io-examples/blob/master/tests/0001-print-number-of-calls-to-memcpy.patch

The numbers of memcpy calls are 2097153 and 0 from the above two runs, respectively.

hzhou commented 5 months ago

I haven't looked at the code, but a scale of 4 (from 8.27 to 33.07) from contig to noncontig seems normal to me especially if the data consists of many small segments.

wkliao commented 5 months ago

The noncontiguous case adds a gap of 16 bytes into the buffer. means the buffer has two contiguous segments. One is of size 256 bytes and the other 256x16x8191 bytes. Two are separated by a gap of 16 bytes.

The focus point of this issue is the numbers of memcpy calls, as indicated in the issue title, which is 2097153 per process. In fact, ROMIO can be fixed to reduce that to 2 memcpy calls.

The test runs I provided was just to prove the point. It is small and reproducible even on one computer node, easier for debugging. When tested with less number of processes, say 8, the timing gap becomes bigger, 24.48 vs. 1.98 seconds. The actual runs reported by PnetCDF user are in much bigger scale, with a total write amount > 20GB. Time difference was 198.5 vs. 14.9 seconds.

hzhou commented 5 months ago

Thank you for the details!

Hui

hzhou commented 5 months ago

Writing down my notes after looking at the code –

The buffer in memory is a "dense" non-contig datatype – in the reproducer it's two segments -- but the filetype is fairly fragmented. In the aggregate code, we calculate a contig_access_count, which is the number of segments as a result of intersect between memory buffer datatype and the file datatype. In the reproducer, this results in 2097153 for each process. In ADIOI_Fill_send_buffer, each process memcpy the segments into a send buffer before sending to the aggregators, and this results in 2097153 memcpy, significantly hurting performance.

@wklaio Does the above describe the issue?

I am not familiar with ROMIO code so I could be way off – why don't we use MPI_Pack to prepare the send buffer?

Also something I have been thinking, if we support a "partial datatype", e.g. MPIX_Type_create_partial(old_count, old_type, offset, size, &new_type), that may be useful. It let middle-ware users such as ROMIO to directly use MPI do pipeline-like operations without messing with flat_list or contig_segments.

-- Hui

wkliao commented 5 months ago

Your understanding of the issue is correct.

I am not familiar with ROMIO code so I could be way off – why don't we use MPI_Pack to prepare the send buffer?

I think it is because memory footprint. In my test program, the addition memory space is 32 MB. For bigger problem size, the footprint is bigger.

I do not follow the idea of "partial datatype". Will it help construct a datatype that is an intersection of 2 other datatypes (user buffer type and file view)?

roblatham00 commented 5 months ago

@wkliao Hui implemented a way to work on datatypes without flattening the whole thing first. we would still have to compute the intersection of memory type and file view but I think his hope is that the datatype data structures might be less memory intensive -- not as a solution to this issue but an idea for a ROMIO enhancement that came to mind while looking at this code.

wkliao commented 5 months ago

As the current implementation of collective I/O is done in multiple rounds of two-phase I/O, if such partial datatype flattening could work, then I expect the memory footprint could be reduced significantly, which will be great.

FYI, I added codes inside ROMIO to measure the memory footprint and ran pio_noncontig.c using the commands provided in my earlier comments. The high watermark is about 300 MB (maximal among 16 processes) for such a small test case. I think it mainly comes from the fileview datatype flattening.

As for this issue, my own solution is to check whether or not the part of user buffer in each two-phase I/O round is contiguous. If it is, then use it to call MPI_Issend and thus, skip most of the memcpy calls.

pmodels / mpich

ROMIO: excessive number of calls to memcpy() #6985