openPMD / openPMD-api

:floppy_disk: C++ & Python API for Scientific I/O
https://openpmd-api.readthedocs.io
GNU Lesser General Public License v3.0
138 stars 51 forks source link

Add rank table for locality-aware streaming #1505

Closed franzpoeschel closed 4 months ago

franzpoeschel commented 1 year ago

This is the first logical part of #824 which I am now splitting in two separate PRs:

  1. Writing and reading a rank table that a reader code can use to determine which MPI rank was running on which compute node.
  2. Chunk distribution algorithms based on that information.

The idea is that a writer code can either explicitly use:

series.setMpiRanksMetaInfo(/* myRankInfo = */ "host123_numa_domain_2");

.. or alternatively initialize the Series with JSON/TOML parameter rank_table:

Series write(..., R"(rank_table = "hostname")");

(The second option is useful as it requires no rewriting of existing code.)

A 2D char dataset is then created, encoding the per-rank information line by line, e.g.:

$ bpls ranktable.bp/ -de '/rankTable'
  char     /rankTable                                {4, 8}
    (0,0)    h a l 8 9 9
    (0,6)    9  h a l 8
    (1,4)    9 9 9  h a
    (2,2)    l 8 9 9 9 
    (3,0)    h a l 8 9 9
    (3,6)    9  

A reader code can then access this information via:

>>> import openpmd_api as io
>>> s = io.Series("ranktable.bp", io.Access.read_only)
>>> s.get_mpi_ranks_meta_info(collective=False)
{0: 'hal8999', 1: 'hal8999', 2: 'hal8999', 3: 'hal8999'}

And compare that information against unsigned int WrittenChunkInfo::sourceID as is returned by availableChunks():

struct WrittenChunkInfo : ChunkInfo
{
    unsigned int sourceID = 0; //!< ID of the data source containing the chunk
    ...
};
franzpoeschel commented 11 months ago

Regarding the ws2_32 lib for gethostname - there is a more portable way to achieve this with standard MPI calls: can MPI_Get_processor_name in chapter 9 of the MPI standard:

https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

Note that this sometimes appends and sometimes does not append the CPU id as well.

Using the MPI call is not ideal either, since MPI is not always available and the call suffers from exactly the same trouble:

*** The MPI_Get_processor_name() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.

This is in practise a lesser trouble since most applications that use MPI also have it initialized, but I think the better solution is to make the used implementation explicit: I was intending for this call to have multiple implementations anyway:

// include/openPMD/ChunkInfo.hpp
namespace host_info
{
    enum class Method
    {
        HOSTNAME
    };

    std::string byMethod(Method);

#if openPMD_HAVE_MPI
    chunk_assignment::RankMeta byMethodCollective(MPI_Comm, Method);
#endif

    std::string hostname();
} // namespace host_info
// src/Series.cpp
                if (viaJson.value == "hostname")
                {
                    return host_info::hostname();
                }
                else
                {
                    throw error::WrongAPIUsage(
                        "[Series] Wrong value for JSON option 'rank_table': '" +
                        viaJson.value + "'.");
                };

The thought behind this is: In some setups, you want writer and reader to match by hostname, in other cases by NUMA node, in other cases by CPU id, you might even want to group multiple hosts into one group. Since the chunk distribution algorithms in #824 use this rank table as a basis for distributing chunks, this keeps that choice flexible for the user.

My suggestion hence: Split this into POSIX_HOSTNAME, WINSOCKS_HOSTNAME, MPI_HOSTNAME. Users would then need to inquire which options are available on their systems. The enum class Method would always contain all options without any #ifdefs, and we could add a call host_info::method_available(Method) -> bool.

This way, users would explicitly need to state that they want to use the Winsocks implementation, along with all implications that that has.