Open jedwards4b opened 1 year ago
Hi @jedwards4b, I wonder how much of those writes are file metadata as opposed to raw data.. it's important to note that the GDS VFD is pretty limited at the moment and will only perform cuFileWrites of device-allocated buffers, so basically just calls to H5Dwrite. It might be worth adding a printf to H5FD__gds_write to print out the H5FD_mem_t "type" parameter and check those values to see what type of data is actually coming in to the write call and that may give us an idea why the device is being underutilized.
The data type is 3 both for writes using the cufileWrite and for those which are not. I did a traceback for one of the writes which is not using the cufileWrite and what I think I see is that it's copying back to the cpu in subroutine H5FD__gds_ctl but I'm not sure why.
Here is a traceback if it will help. The cudamalloc is done in hdf5var.c at the bottom of this stack.
#0 H5FD__gds_ctl (_file=0x555555874ec0, op_code=7, flags=3, input=0x7ffffffeeb78, output=0x0)
at /home/parallelio_development/src/vfd-gds/src/H5FDgds.c:1959
#1 0x00007ffff73747ec in H5FD_ctl (file=0x555555874ec0, op_code=7, flags=3, input=0x7ffffffeeb78, output=0x0)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5FD.c:2188
#2 0x00007ffff725831d in H5D__compact_iovv_memmanage_cb (dst_off=3584, src_off=28672, len=32, _udata=0x7ffffffeed30)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5Dcompact.c:295
#3 0x00007ffff784ec2c in H5VM_opvv (dst_max_nseq=1, dst_curr_seq=0x7ffffffeee60, dst_len_arr=0x5555563fc8c8,
dst_off_arr=0x5555563fe8d8, src_max_nseq=1024, src_curr_seq=0x7ffffffeee58, src_len_arr=0x555555b89198,
src_off_arr=0x555555b8b1d8, op=0x7ffff7258000 <H5D__compact_iovv_memmanage_cb(hsize_t, hsize_t, size_t, void*)>,
op_data=0x7ffffffeed30) at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5VM.c:1315
#4 0x00007ffff7258b7e in H5D__compact_writevv (io_info=0x7ffffffef398, dset_info=0x7ffffffef258, dset_max_nseq=1,
dset_curr_seq=0x7ffffffeee60, dset_size_arr=0x5555563fc8c8, dset_offset_arr=0x5555563fe8d8, mem_max_nseq=1024,
mem_curr_seq=0x7ffffffeee58, mem_size_arr=0x555555b89198, mem_offset_arr=0x555555b8b1d8)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5Dcompact.c:404
#5 0x00007ffff72b5fcf in H5D__select_io (io_info=0x7ffffffef398, dset_info=0x7ffffffef258, elmt_size=8)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5Dselect.c:227
#6 0x00007ffff72b7572 in H5D__select_write (io_info=0x7ffffffef398, dset_info=0x7ffffffef258)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5Dselect.c:492
#7 0x00007ffff7240314 in H5D__chunk_write (io_info=0x7ffffffef778, dset_info=0x7ffffffef920)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5Dchunk.c:3270
#8 0x00007ffff7292279 in H5D__write (count=1, dset_info=0x7ffffffef920)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5Dio.c:738
#9 0x00007ffff78387bd in H5VL__native_dataset_write (count=1, obj=0x7ffffffefc60, mem_type_id=0x7ffffffefd20,
mem_space_id=0x7ffffffefd18, file_space_id=0x7ffffffefd10, dxpl_id=792633534417207359, buf=0x7ffffffefd00, req=0x0)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5VLnative_dataset.c:407
#10 0x00007ffff7803756 in H5VL__dataset_write (count=1, obj=0x7ffffffefc60, cls=0x555555652bc0, mem_type_id=0x7ffffffefd20,
mem_space_id=0x7ffffffefd18, file_space_id=0x7ffffffefd10, dxpl_id=792633534417207359, buf=0x7ffffffefd00, req=0x0)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5VLcallback.c:2236
#11 0x00007ffff7803a4d in H5VL_dataset_write_direct (count=1, obj=0x7ffffffefc60, connector=0x55555573fb60,
mem_type_id=0x7ffffffefd20, mem_space_id=0x7ffffffefd18, file_space_id=0x7ffffffefd10, dxpl_id=792633534417207359,
buf=0x7ffffffefd00, req=0x0) at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5VLcallback.c:2280
#12 0x00007ffff721c94c in H5D__write_api_common (count=1, dset_id=0x7ffffffefd28, mem_type_id=0x7ffffffefd20,
mem_space_id=0x7ffffffefd18, file_space_id=0x7ffffffefd10, dxpl_id=792633534417207359, buf=0x7ffffffefd00, token_ptr=0x0,
_vol_obj_ptr=0x0) at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5D.c:1331
#13 0x00007ffff721cc6f in H5Dwrite (dset_id=360287970189639698, mem_type_id=216172782113784124, mem_space_id=288230376151711842,
--Type <RET> for more, q to quit, c to continue without paging--
file_space_id=288230376151711843, dxpl_id=792633534417207359, buf=0x7fff9a000000)
at /lustre/hack_teams/parallelio_development/src/hdf5/src/H5D.c:1388
#14 0x00007ffff7dd91ab in NC4_put_vars (ncid=131072, varid=0, startp=0x555555ba1f40, countp=0x555555ba1f60, stridep=0x0,
data=0x7fffc5fff010, mem_nc_type=6) at hdf5var.c:1816
It seems that if the variable does not have the netcdf unlimited dimension it uses the cuFileWrite but if the variable does have the unlimited dimension it copies it back to cpu before writing.
Based on the trace here, I think it may be the case that the GDS VFD is copying the buffer back to the CPU due to the library attempting to make use of HDF5's chunk cache. In this case the I/O path is not well-adapted for the GDS VFD yet and so we currently incur the overhead from the copy back until further development can be done.
As an experiment, one thing you could try is to disable HDF5's chunk cache and writing of fill values to the dataset with calls to H5Pset_cache(fapl_id, 0, 0, 0, 0.0) on a File Access Property List used for opening the file and H5Pset_fill_time(dcpl_id, H5D_FILL_TIME_NEVER) on a Dataset Creation Property List used for creating the datasets. I'm assuming that would probably involve some hacking up of netcdf though. If you'd rather hack up HDF5, you could also try setting the cacheable
variable at https://github.com/HDFGroup/hdf5/blob/develop/src/H5Dchunk.c#L3199 so that it's always false regardless of what H5D__chunk_cacheable returns.
I'm pretty sure that the fill was already off, I added the set_cache call but it doesn't seem to have changed the behavior. I'll try asking the netcdf folks if they might know what's going on. I think that for this hackathon I can convert the unlimited dimension to a fixed dimension and avoid the problem for now.
Yes reformatting the file and removing the unlimited dimension allows us to use cuFileWrite for nearly all of the input variables. However, and not surprising, if I add data compression it copies back to the cpu and does not use cuFileWrite. We are working on a new hdf5 filter which will do data compression on the gpu - my question now is will hdf5 call the data compression filter with the data on the gpu or will it copy back to cpu before it calls the filter?
Unfortunately, in the case of data compression the library currently forces I/O through the chunk cache path, so I'd expect that it's going to copy the data back to the cpu before calling the filter. I sort of have an idea how to hack around that, but it'd take some investigation if you want the library to gracefully pass your device-allocated buffer through to the filter without touching it.
If you can suggest a hack we would appreciate it. We are planning a simple filter that just prints the size of the data buffer and we can also have it indicate if it's using a device buffer or not. I'll write back and let you know when we have completed that step. Thank you so much for your help.
We were able to complete that step today and indeed the data is back in cpu memory by the time it gets to the filter. But on another front - when we converted the data file to not have an unlimited dimension we were able to see that all of the calls were going to the cuFileWrite. But the performance still wasn't very good, so we had some cufile engineers from nvidia do some analysis with us and we saw that although we were calling the cuFileWrite interface the writes are mostly being diverted to posix because (?) they were not aligned on 4K boundaries. Our write buffers are 1MB - if its not aligned on 4k wouldn't the first <4k use posix and the rest use GDS - so if each write isn't aligned most of it would still use GDS? Apparently that isn't what's happening.
I'm afraid I'm not knowledgeable enough about the internals of cuFileWrite and similar to state anything definitive (I didn't really write most of the VFD), but I imagine that the mis-alignment is occurring across the whole buffer. For I/O, the VFD tries to create some threads (just 1 by default), split up the data buffer among the threads and then do I/O according to a fixed block size (looks like 8MiB by default). When the I/O size is less than the block size it just does a cuFileWrite of everything, so I imagine you're getting writes of the whole 1MiB buffers that are just fully mis-aligned, but that's just speculation on my part. You could possibly experiment with changing the block size by setting the "H5_GDS_VFD_IO_BLOCK_SIZE" environment variable, but I think the better thing to try would be to use https://portal.hdfgroup.org/display/HDF5/H5P_SET_ALIGNMENT on a File Access Property List for the file, as in H5Pset_alignment(fapl_id, chunk_size, 4096). It may waste a little space in the file, but that should make sure that any object created in the file of size chunk_size
or larger will be aligned on 4K boundaries, which I think should help you out a bit. For the block size, note that for some reason the block_size
parameter for the H5Pset_fapl_gds() call appears to be something different and doesn't influence the fixed block size for I/O, so if you experiment with that it will have to be set by the environment variable. Though, I'm hoping that simply setting the file alignment will allow all your writes to be on 4k boundaries and hopefully go to the device rather than posix.
As for the hack I mentioned earlier, the trick is that we can't avoid the chunk cache path in the library due to the way data filters are currently handled, but we might be able to hack up the source code so that rather than the library trying to allocate and work with chunk data buffers using plain malloc/memcpy/memset (or H5MM_malloc/memcpy/HDmemset as it were), we should be able to make the library ask the VFD to allocate device buffers and perform the memcpy/memset/etc. operations on its behalf. Most of this occurs in H5D__chunk_lock when trying to lock a chunk into the chunk cache. For example: https://github.com/HDFGroup/hdf5/blob/develop/src/H5Dchunk.c#L4459-L4464, https://github.com/HDFGroup/hdf5/blob/develop/src/H5Dchunk.c#L4479, https://github.com/HDFGroup/hdf5/blob/develop/src/H5Dchunk.c#L4501-L4511, https://github.com/HDFGroup/hdf5/blob/develop/src/H5Dchunk.c#L4524-L4556. Those calls to H5D__chunk_mem_alloc() need to be made more intelligent so that the library knows whether it can allocate and work with buffers using standard C calls, or if it needs to ask the VFD to do it.
If those buffers were to get allocated on the device, then the following I/O call at https://github.com/HDFGroup/hdf5/blob/develop/src/H5Dchunk.c#L3270 should correctly ask the GDS VFD to do cudaMemcpy operations to copy data from the application buffer into a temporary device-allocated buffer for the chunk. Then, when the chunk is eventually evicted from the cache (or immediately written if it wasn't cacheable), the call to H5D__chunk_flush_entry() should be passing a device-allocated buffer to your filter. Although note that there are a couple edge cases to be dealt with there as well, such as https://github.com/HDFGroup/hdf5/blob/develop/src/H5Dchunk.c#L3921-L3928. In short, it's a bit round-about and would need a non-trivial amount of hacking, but it should be possible. We could certainly try hacking around until a particular use case works the way you'd expect, but the general solution is far more involved.
I am participating in an nvidia hackathon this week and attempting to use the vfd-gds code in my netcdf library. I have been able to modify the nccopy program so that it uses GDS as an output device and I have been able to verify the correctness of the output. However the gds_stats program and the cufile.log seem to indicate that very few of the writes are actually using the GDS mechanism. To confirm this I added printf statements at the top of H5FD__gds_write and at each of the calls to cuFileWrite therein.
Specifically I see 3052 calls to H5FD__gds_write and only 15 calls to cuFileWrite. Can you help me to understand this behavior and if possible improve upon it?