open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 861 forks source link

MPI_Datatype definitions cannot be used in constexpr statements #10017

Open bangerth opened 2 years ago

bangerth commented 2 years ago

I was trying to write a little generic C++ routine that translates the data type of a variable into the corresponding MPI_Datatype value. Something of this sort (the specific details don't actually matter, but you get the idea):

    namespace internal
    {
        inline MPI_Datatype
        mpi_type_id(const char *)
        {
          return MPI_CHAR;
        }

        ... similar for the other supported data types ...

    }   // namespace internal

    /**
     * A template variable that translates from the data type given as
     * template argument to the corresponding
     * `MPI_Datatype` to be used for MPI communication.
     *
     * As an example, the value of `mpi_type_id<int>` is `MPI_INT`. A
     * common way to use this variable is when sending an object `obj`
     * via MPI functions to another process, and using
     * `mpi_type_id<decltype(obj)>` to infer the correct MPI type to
     * use for the communication.
     */
    template <typename T>
    const MPI_Datatype mpi_type_id = internal::mpi_type_id(
      static_cast<std::remove_cv_t<
        std::remove_reference_t<std::remove_all_extents_t<T>>> *>(nullptr));

The problem is that at least some compilers don't allow making the variable (and the query functions) constexpr because, as I found out to my surprise, the definitions of things such as MPI_CHAR are not in fact constexpressions. I get compiler errors like

error: expression must have a constant value
/usr/lib/x86_64-linux-gnu/openmpi/include/mpi.h(925): note: type "ompi_predefined_datatype_t" too large for constant-expression evaluation

or

/Users/runner/work/dealii/dealii/include/deal.II/base/mpi.h:1488:28: error: constexpr variable 'mpi_type_id<unsigned long>' must be initialized by a constant expression
    constexpr MPI_Datatype mpi_type_id = internal::MPIDataTypes::mpi_type_id(
                           ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/runner/work/dealii/dealii/include/deal.II/base/mpi.h:1894:28: note: in instantiation of variable template specialization 'dealii::Utilities::MPI::mpi_type_id<unsigned long>' requested here
                           mpi_type_id<decltype(buffer_size)>,
                           ^
/Users/runner/work/dealii/dealii/include/deal.II/base/mpi.h:1419:18: note: cast from 'void *' is not allowed in a constant expression
          return MPI_UNSIGNED_LONG;
                 ^

Upon digging around, here's what I could find: We have

#define MPI_UNSIGNED OMPI_PREDEFINED_GLOBAL(MPI_Datatype, ompi_mpi_unsigned)

and then

#define OMPI_PREDEFINED_GLOBAL(type, global) (static_cast<type> (static_cast<void *> (&(global))))

and

extern struct ompi_predefined_datatype_t ompi_mpi_unsigned;

and

typedef struct ompi_datatype_t *MPI_Datatype;

and

struct ompi_predefined_datatype_t {
    struct ompi_datatype_t dt;
    char padding[PREDEFINED_DATATYPE_PAD - sizeof(ompi_datatype_t)];
};

So, if I see this right, then this all together yields

struct ompi_datatype_t {...};
typedef struct ompi_datatype_t *MPI_Datatype; // MPI_Datatype is a pointer

struct ompi_predefined_datatype_t {
    struct ompi_datatype_t dt;
    char padding[PREDEFINED_DATATYPE_PAD - sizeof(ompi_datatype_t)];
};

extern struct ompi_predefined_datatype_t ompi_mpi_unsigned;

#define MPI_UNSIGNED (static_cast<MPI_Datatype> (static_cast<void *> (&ompi_mpi_unsigned))

So at least that second error I show above is caused by the void* cast. I really can't quite make heads and tails of the first compiler's error message.

Obviously, none of my desire to be able to do things as constexpr is backed by the MPI standard, which has nothing to say about the issue. But as a matter of quality of implementation, it would nonetheless be nice if one could do this. I believe that that wouldn't actually be all that hard: The two casts here really just extract a pointer to the dt sub-object of the ompi_predefined_datatype_t structure, so the following alternative definition of MPI_UNSIGNED should in principle suffice:

#define MPI_UNSIGNED ompi_mpi_unsigned.dt

I suspect that I'm missing major considerations here, though :-)

devreal commented 2 years ago

That is indeed a problem I ran into in the past. The MPI standard gives the implementations the freedom to implement MPI handles in whichever way they want. Since C++ is not officially supported by MPI, this isn't even a quality of implementation question.

Your suggestion won't work either because ompi_mpi_unsigned is an external symbol that is not a compile-time constant. For Open MPI to support constexpr object handles it would have to completely change its ABI (probably not gonna happen unless some higher entity forces it to).

One possible solution would be wrap MPI types in your types that are constexpr and delay access to the underlying MPI type until execution time (probably needs some mapping function that can be inlined, not sure). That might not work with derived types, in case you need them.

devreal commented 2 years ago

You might want to add your wish to the list in https://github.com/mpi-forum/mpi-issues/issues/288 (if it's not already there). The MPI Forum is discussing a future C++ interface and I agree that support for constexpr would be nice (although I don't like the repercussions it might have on implementations).

bangerth commented 2 years ago

@devreal Actually, I think that this should still work. ompi_mpi_unsigned is an external variable whose contents are unknown in the header file and that consequently can't be accessed in a constexpr variable. But MPI_UNSIGNED only refers to the address of the object, and that is (at least as far as I know) a constexpr.

Indeed, this works:

struct S { int id; };

extern struct S internal_MPI_UNSIGNED;
extern struct S internal_MPI_INT;

constexpr S* MPI_UNSIGNED = &internal_MPI_UNSIGNED;
constexpr S* MPI_INT      = &internal_MPI_INT;

constexpr S* get_type_id ()
{
  return MPI_UNSIGNED;
}

int main()
{
  return (get_type_id() == MPI_INT);
}

// ---------- usually put into a different file --------

struct S internal_MPI_UNSIGNED = {13};
struct S internal_MPI_INT      = {42};
bangerth commented 2 years ago

I think that the root cause is really just the static_cast<void*> where you're doing type-punning because you want the address of the first member of the struct.

devreal commented 2 years ago

That's surprising. I stand corrected :) The cast is needed because of the padding put into predefined types in order to keep the ABI stable. Exposing the implementation of ompi_predefined_datatype_t (and necessarily ompi_datatype_t) defeats the purpose. I cannot quite make sense of the explanation given in https://github.com/open-mpi/ompi/blob/master/ompi/communicator/communicator.h#L349 as to why the padding is necessary in the first place, will need to understand that first.

bangerth commented 2 years ago

I think the cast isn't actually for the padding, but because you define MPI_UNSIGNED as a pointer to a member of a structure of a different type. You happen to know that the member is at offset zero, so their addresses are the same, but have different types, and the double cast is simply a way to do type punning of the pointer.

What I don't understand is why MPI_UNSIGNED isn't just defined as &ompi_mpi_unsigned.dt. That has the right type already, and because the ompi_predefined_datatype_t is declared in the header file, everything you need is actually available in the place. It's the same address, and doesn't need the type punning. That is, something of the form

#define MPI_UNSIGNED  ompi_mpi_unsigned.dt.
ggouaillardet commented 2 years ago

@bangerth my 0.02US$

shouldn't it be instead (note the &) ?

#define MPI_UNSIGNED &ompi_mpi_unsigned.dt

Anyway, internally, ompi_datatype_t and ompi_predefined_datatype_t are fully defined.

However, from an end user point of view (e.g. #include <mpi.h>), only (pointers to an ) unspecified struct are used, so the dt field is not exposed to the end users.

bangerth commented 2 years ago

Yes, it should have been the address.

devreal commented 2 years ago

After talking to @bosilca on why we have separate types for predefined MPI handles, here is my take-away: the internal data structures of Open MPI (ompi_datatype_t being among them) may change when Open MPI is compiled with different options (debug probably the most likely culprit). ompi_mpi_unsigned is defined as an external symbol so the linker records its size at link time and the loader compares that size to the actual size of the object in the loaded shared library. If you link against one installation of Open MPI and then run with another installation (one that has debugging enabled) the loader will complain because the size changed and accessing ompi_mpi_unsigned directly could result in weirdness. The linker doesn't know that the application will never actually access ompi_mpi_unsigned directly and is only interested in the pointer to it so it warns if the size of the object changes. The solution in Open MPI is to provide a padded wrapper type that is ABI stable (ompi_predefined_datatype_t).

We cannot expose the definition of ompi_predefined_datatype_t in the public header file since that would require us to also expose ompi_datatype_t (as @ggouaillardet mentioned). That would get us back to square 1.

This is a specific problem with predefined MPI handles and the fact that handles in Open MPI are pointers to objects. I don't have a good solution at hand but I agree that it would be nice to eventually have constexpr handles.

jeffhammond commented 1 year ago

@bangerth MPI handles are not specified to be compile-time constants, only link-time constants, so you cannot use constexpr on them, regardless of how they are implemented in Open-MPI, without making non-standard assumptions.

2.5.4 Named Constants

All named constants, with the exceptions noted below for Fortran, can be used in initialization expressions or assignments, but not necessarily in array declarations or as labels in C switch or Fortran select/case statements. This implies named constants to be link-time but not necessarily compile-time constants.