lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
287 stars 94 forks source link

`PARTFILE` support for near-null vectors (and eigenvectors) #1398

Closed weinbe2 closed 1 year ago

weinbe2 commented 1 year ago

This PR exposes the ability to save near-null vectors (and eigenvectors) in QIO's PARTFILE format, which is one file per MPI rank. The primary purpose of this is to speed up the saving (and loading) of near-null vectors during MG when tuning the algorithm, but it can also be used (very effectively) in production runs so long as you can assume the process decomposition will not change between runs.

A description of a PARTFILE workflow where files are stored to per-node local scratch disks, copied to the network drive after the run---and then the process is run in reverse on later runs---has already been documented on the QUDA wiki here.

This is threaded through the test executables via the flags --mg-save-partfile and --eig-save-partfile, as well as through the MILC MG interface.

Of note: there is no need for an analogous "loading" flag because QIO will automatically look for singlefile, then partfile, versions of a file on the load. There is also no functional reason why this can't be added for gauge fields as well, there is just far less of a use case (and much more risk for confusion).

This has been verified to give a speedup for 144^3x288 HISQ MG workflows on Selene where saving 64 fine-level near-null vectors goes from taking ~144 seconds to ~6 seconds. While I don't have the allocation to perform fresh timings on other machines, historically I have seen the analogous save take up to an hour on Summit; it's expected this would be much faster with the on-node SSDs.

weinbe2 commented 1 year ago

This is a great addition. A couple of things:

  • io_test needs to be extended to test the PARTFILE saving. Of course this will only be non-trivial when running on multiple processes, but that's fine.
  • Should the QudaBoolean addition in the interface instead just be bool? We already implicitly require C99 support, so I see no reason not to just use bool. While in the long term we'll want to remove the QudaBoolean for legacy interface options, perhaps now we draw a line in the sand and just use bool for new additions?

re: io_test, I had it in the to-do checklist in my PR already :) But it's also now done in https://github.com/lattice/quda/pull/1398/commits/fd467c00c1899fe3a4c5e810d11b008d3dae212b .

As for just using bool... I agree we should remove it in the future, but in the off chance this causes a problem for some external code, I don't want this to be the PR that triggers it. I think we should change the convention in one go and deal with the consequences then.

maddyscientist commented 1 year ago

Fair enough regarding bool / QudaBoolean. Line in sand can be drawn another day.