Clarify alignment requirements

jdinan commented 7 years ago

Alignment requirements on window buffers and RMA operations are not currently clear.

My current understanding is that window buffers have no alignment requirement, but that the effective target address (base + offset*disp_unit) for RMA operations must be naturally aligned for the datatype used in the operation.

I can't seem to locate text for this semantic. This issue is a placeholder to track down the semantic and add a clarification somewhere in the RMA chapter.

jeffhammond commented 7 years ago

We should make sure to fix MPI_Alloc_mem while we are doing this. I advocate for info keys as the mechanism to request alignment in both MPI_Alloc_mem and the MPI_Win_allocate(_shared) cases.

devreal commented 5 years ago

What is the status of this issue? The missing alignment guarantee is somewhat of an annoyance for us and we support the idea of @jeffhammond to request an alignment using info keys (in case this hasn't been discussed and down-voted already).

devreal commented 5 years ago

Since I had some time to spare over the weekend I thought I'd give this a try. I'm not sure how to properly format such a proposal so I am simply marking my proposed changes in the text:

8.2 Memory Allocation

In some systems, message-passing and remote-memory-access (RMA) operations run faster when accessing specially allocated memory (e.g., memory that is shared by the other processes in the communicating group on an SMP). MPI provides a mechanism for allocating and freeing such special memory. The use of such memory for message-passing or RMA is not mandatory, and this memory can be used without restrictions as any other dynamically allocated memory. However, implementations may restrict the use of some RMA functionality as defined in Section 11.5.3. The memory allocated through this functionality should be suitably aligned for any predefined MPI datatype.

[...]

The info argument can be used to provide directives that control the desired location of the allocated memory. Such a directive does not affect the semantics of the call. ~Valid~ These info values are implementation-dependent; The info key memalign can be used to control the alignment of the allocated memory in powers of two. A null directive value of info = MPI_INFO_NULL is always valid.

11.2.3 Window That Allocates Shared Memory

The allocated memory is contiguous across process ranks unless the info key alloc_shared_noncontig is specified. Contiguous across process ranks means that the first address in the memory segment of process i is consecutive with the last address in the memory segment of process i − 1. This may enable the user to calculate remote address offsets with local information only. For contiguous memory, the memalign info key applies only to process i=0.

A.1.5 Info Keys

Add the key memalign key to the list of info keys.

Note that there are no changes required to the chapter 11.2.2 (Window That Allocates Memory) since it is included through the references of Section 8.2.

The implementation of the memalign info key for MPI_Alloc_mem should be straight-forward as all platforms provide functionality along the lines of posix_memalign. Window allocation function can either use posix_memalign or, if the window is backed by shared memory, allocate additional space used to ensure alignment on each process.

The description of the default alignment matches the description taken from malloc and is intended to require MPI implementations to make sure that any type supported by MPI can be used using load/store operations on the memory allocated through MPI. Currently this requires a 16B alignment (for long double), i.e., the alignment provided by malloc.

With these changes the user will a) be able to rely on proper memory alignment (which is not the case currently), and b) will be able to control memory alignment in windows, which can be beneficial for vectorized computation on that memory. Doing the latter manually would require additional management of offset tables by the user.

jeffhammond commented 5 years ago

@devreal Do you have access to https://github.com/mpi-forum/mpi-standard? If so, then you can fork that and create a pull request for your proposed changes. If you do not have access or have other reasons for not wanted to create such a pull request, then somebody else should be able to do it for you (at a cost of latency).

devreal commented 5 years ago

@jeffhammond Thanks for the quick reply. I have access to the MPI-Forum github repo and started writing into a fork. Should I create an issue at https://github.com/mpi-forum/mpi-issues/issues/ first that can be linked in the PR (and changelog) or is a PR sufficient. So far this topic only seems to have come up in the RMA WG.

hjelmn commented 5 years ago

@devreal I think this is a solid idea but we may need to think about the naming of the info key as this should be an assertion not a hint.

hjelmn commented 5 years ago

See comm info keys in the draft standard. Ex: mpi_assert_no_any_tag.

devreal commented 5 years ago

@hjelmn Interesting, I was not aware of these new keys. I'm not sure I fully understand the difference between assertions (as used for communicators) and hints (as used in the RMA chapter) though. I would assume that a key like no_locks is an assertion by the programmer (that no locks will be used).

The assertions for communicators all sound similar to no_locks while I think of the memalign (mem_align?) key to be closer to the cb_block_size hint (stating a request to the implementation rather than an assertion made by the programmer).

hjelmn commented 5 years ago

In this case I see these as assertions to the implementation. If it can not fulfill the request it should just return an error (or abort depending on the error handler). I think cb_block_size to me is more of a hint that can be used to tweak the implementation. I could be wrong though.

The idea of info assertions is fairly new. We will need to iterate on this to figure out what the correct naming for the info key is. We have some time as this missed the two-week deadline for the Dec meeting so will be discussed in March.

devreal commented 5 years ago

Ahh I see, so the meaning of assertion is different than what I thought. Thanks for the clarification. There is no need to rush then and some discussion during one of the next phone calls is probably a good idea. I won't file the PR before we settled on a good name then.

pavanbalaji commented 5 years ago

Actually, I don't think this is an assert. @devreal's original interpretation was what my understanding of assert hints is too. So no_locks would be an assert (i.e., the user is promising not to do something). cb_block_size would be a hint (i.e., the user is simply requesting the implementation to do something, but not promising anything).

hjelmn commented 5 years ago

@pavanbalaji In this case the user wants to enforce a specific alignment. If it can't be done it should produce an error. Shouldn't we have a special naming for these since it really is not a hint?

jeffhammond commented 5 years ago

I want both a hint and an assert version of this, by the way.

hjelmn commented 5 years ago

@jeffhammond Thats certainly possible. Maybe mpi_mem_alignment and mpi_mem_require_alignment?

jeffhammond commented 5 years ago

I want to follow the existing naming convention:

mpi_assert_alignment
mpi_alignment

hjelmn commented 5 years ago

@jeffhammond Sounds reasonable to me.

pavanbalaji commented 5 years ago

@hjelmn and @jeffhammond

I don't think asserts are defined that way. Let me clarify a few things:

We don't require the MPI implementation to throw errors in any case. For example, if an MPI implementation does not understand your key, it can simply ignore it, which means there's not going to be an error associated with it. We can add a recommendation to the MPI implementation that if it is not able to honor this alignment, it should throw a noncatastrophic error. That's just a recommendation. No action is required by the MPI implementation.
Asserts are only promises from the user. "Give me an allocation with this alignment" is not a promise, it's a request.

With that explanation, I don't think the two info keys are any different.

devreal commented 5 years ago

As @pavanbalaji has pointed out, it is not possible to have the two different info keys with the current state of affairs so we are left with the best-effort user request. It is left to the user to check the provided alignment if more than natural alignment is required and error out or even retry with a smaller alignment if feasible. However, I am not sure why implementations would impose an upper limit on the supported alignment (which would be the only use-case I see for having two keys).

What is the way forward from here? Can we find a consensus on a name? @hjelmn proposed mpi_alignment, which is fine with me. It is a bit less specific than [mpi_]memalign or [mpi_]memory_alignment (which seems unnecessarily long). Is there a chance that the term alignment will be used in the same or different context in MPI that might lead to a collision?

devreal commented 4 years ago

@jdinan The window buffer alignment requirements will be in MPI 4. If I understand your initial post correctly, you are also looking at whether the target address of RMA operations should be naturally aligned. AFAICS, there is no language in the standard mandating alignment of either the origin or the target address of RMA operations. There is some language in an AtoU stating that "the alignment of the communication buffers may also impact performance" (page 419 of MPI 3.1) but that of course is not normative.

Should a mandate for natural alignment of origin and target addresses be added as part of the cleanup 4.1 release? It seems like a non-backwards-compatible change though. I guess implementation can take a slow path if either the origin address or the target address is misaligned (allocating a temporary aligned origin buffer or avoiding RDMA operations for misaligned target addresses).

jdinan commented 4 years ago

AMOs in particular may not work if they aren't naturally aligned. I think the risk of breaking existing applications with this clarification is relatively low.

devreal commented 4 years ago

Could this be treated as an errata? (I'm not sure about what can be treated as an errata and what not) Otherwise this will have to be moved to MPI 4.1...

jdinan commented 4 years ago

I would consider this to be an erratum to MPI 3.0, but I'm not sure whether others will feel the same.

devreal commented 4 years ago

I can bring this up at the phone call on Wednesday. Is there a consensus within the WG to add language that requires a) origin buffers and/or b) target memory addresses to be naturally aligned? As mentioned above, implementation could fall-back to a slow-path to deal with misaligned origin/target addresses. So this change would mainly remove some complexity from implementations.

jdinan commented 4 years ago

I don't think the RMA WG has convened in a while. Perhaps @jeffhammond @pavanbalaji @wgropp @rsth can share their thoughts on requiring origin/target buffers to be naturally aligned for the datatype. This should be a requirement for accumulate operations, and I would consider that part to be an erratum. If buffers in accumulate operations are allowed to cross cache lines, then many implementations will not be able to use hardware (including processor) atomics. For put/get it seems more like a performance advice to users (which we sort of already have in window allocation).

wgropp commented 4 years ago

I'm fine with requiring aligned data, as long as the behavior with unaligned data is undefined, not erroneous. That allows implementations to permit operations on unaligned data if they can, but standard-conforming, portable programs need to ensure alignment.

pavanbalaji commented 4 years ago

I don't think the RMA WG has convened in a while. Perhaps @jeffhammond @pavanbalaji @wgropp @rsth can share their thoughts on requiring origin/target buffers to be naturally aligned for the datatype. This should be a requirement for accumulate operations, and I would consider that part to be an erratum. If buffers in accumulate operations are allowed to cross cache lines, then many implementations will not be able to use hardware (including processor) atomics. For put/get it seems more like a performance advice to users (which we sort of already have in window allocation).

@jdinan

I think the alignment requirements being proposed are a bit overspecified. Most (all?) networks only require the part of the data that is atomic to be aligned. For example, only the target buffer in an accumulate operation.

For MPI_Win_allocate, the base address can of course be naturally aligned by the MPI library. Is the concern that the user might give arbitrary offsets that break the natural alignment?

I do see the concern for MPI_Win_create, where buffers are completely specified by the user.

jdinan commented 4 years ago

@pavanbalaji Some processors require alignment for all operands. Of course, the implementation can check for this and copy operands to aligned temporary buffers. But, this feels to me like we would be taking on overhead to support an uncommon usage model. As Bill suggested, we can make unaligned usage undefined, which puts the burden of portability on the application rather than the MPI implementation.

pavanbalaji commented 4 years ago

@pavanbalaji Some processors require alignment for all operands. Of course, the implementation can check for this and copy operands to aligned temporary buffers. But, this feels to me like we would be taking on overhead to support an uncommon usage model. As Bill suggested, we can make unaligned usage undefined, which puts the burden of portability on the application rather than the MPI implementation.

True, for processor atomics (I was thinking of network atomics in my previous message).

I want to point out that this would be a backward incompatible change. That does not mean that we cannot do it. It just means that any decision about this should consider that fact. We could argue that "we always meant that; we just didn't clearly specify it", but that argument is somewhat thin in this context.

If we want to make this change, why is this only for RMA operations? This would certainly be true for reduction collectives too, which have similar semantics (e.g., two processes on shared memory could reduce into the same buffer). But even more broadly, some MPI implementations (e.g., MPICH) "assume" natural alignment while packing/unpacking noncontiguous datatypes in some cases. For instance, in some cases, we use assignment operations instead of memcpy for performance reasons. This assumption would break on platforms that require strict alignment, such as Sparc, that throws a SIGBUS error when the data is not naturally aligned; but would be fine on platforms such as x86, where the architecture is more forgiving.

So, perhaps this is required for many other MPI operations?

devreal commented 4 years ago

I think it is reasonable to allow MPI implementations to expect memory specified by the user (either directly through pointers or indirectly through RMA target offsets) to be naturally aligned for the provided datatype argument. At least in C (and I'm sure in Fortran as well), directly loading or storing an object from a misaligned address is undefined behavior so by extension the user should never pass a misaligned address to MPI. We would simply pass on the rules the currently supported base languages impose.

Given that this would affect a large part of the standard it might be hard to convince the Forum to accept it at the last minute though. Maybe this will have to be punted to 4.1...

jdinan commented 4 years ago

@devreal Good points on the base language requirements. This is an important issue and I would propose it for 4.0.

devreal commented 4 years ago

A quick summary from the discussion today:

1) It's likely not an errata for 4.0. It doesn't fix something that is "wrong" in 3.1 but clarifies something that has been unspecified since the beginning of MPI. That means it has to be a full proposal for 4.1. 2) The argument has been raised that MPI implementations should deal with unaligned memory by default to cope with such edge cases. However, a counter-argument was that it may negatively impact performance for all users. A potential info key was brought up to allow users to guarantee proper alignment of buffers but that would mean extra work for both users and implementors to get back to the status quo. 3) There was significant concern over whether this would break existing applications. For example, there may be cases where applications try to send 2 consecutive 4B integers as a single 8B value (for whatever reason), for which the outcome would be undefined after this change if they happen to be 4B aligned. 4) As a consequence: What does it mean to specify an MPI datatype T for a memory location provided to an MPI procedure? Is it a) here is an object of type T; or b) here is some memory, please interpret the bit pattern as an object of type T. The second option would mean that unaligned cases are correct and would put the burden on the MPI implementation to correctly handle them. 5) The term alignment should be carefully specified to not be overly restrictive (i.e., not in terms of the size of the datatype but in terms of allowing load/store operations with the specified type in case some weird architecture allows for 1B-alignment of all types). 5) An AtoU should be added stating that programs conforming to the base language requirements also fulfill MPI's alignment requirements.

I have some preliminary text that I cobbled together today, which was the basis for the discussion:

2.5.7 Alignment Requirements
For choice or address arguments, if an accompanying MPI datatype describing the data
to be accessed is provided then the value provided for the choice or address argument
should be aligned to at least the alignment required for load/store accesses using a datatype
corresponding to the MPI datatype. Any choice argument for which no MPI datatype is
provided should be aligned for load/store accesses using the expected datatype described
in the corresponding procedure definition.

One argument was that there is no corresponding datatype for derived datatypes in MPI. The definition thus should be recursive starting from the predefined datatypes. I will try to come up with something along those lines.

jdinan commented 4 years ago

You could say something like "... then the address corresponding to each basic datatype element in the provided MPI datatype must be naturally aligned for that basic datatype. For example, on an architecture with byte-addressible memory, a naturally aligned address for an object with basic datatype D must be an integer multiple of the number of bytes in D."

jdinan commented 4 years ago

I don't understand item 5. The x86 is one example of a "weird" architecture that allows unaligned load/store. It is still undefined behavior in the C language specification, even if it works on that architecture. We are proposing to do the same in MPI, make it undefined in MPI but allow implementations to make it work.

devreal commented 4 years ago

Here is another shot at it:

The effective memory address of any object of a type corresponding to a predefined MPI
datatype (with the exception of pair-types defined in Section 5.9.4) determined by a user-
provided MPI datatype and relative or absolute address (including choice arguments) pro-
vided to an MPI procedure must be naturally aligned, i.e., the effective address must be a
multiple of the type’s size in memory. Any choice argument for which no MPI datatype
is provided must be naturally aligned for the expected datatype described in the corre-
sponding procedure definition. The outcome of operations resulting in the use of effective
addresses that are not naturally aligned is undefined.

Advice to users. Portable applications conforming to either the ISO C or Fortran standard
implicitly adhere to these alignment requirements, e.g., by avoiding unsafe pointer
arithmetic that would result in addresses that are not naturally aligned. (End
of advice to users.)

I'm a bit worried that the first sentence is too complex but right now I cannot think of a way to break it up without losing essential information.

pavanbalaji commented 4 years ago

the effective address must be a multiple of the type’s size in memory

I don't think this is correct. On some platforms, long double is 12 bytes, but has a 16-byte alignment requirement. Alignment is not always related to the type size.

devreal commented 4 years ago

You're right, forgot about that (32bit x86 is one such case IIRC). Maybe we shouldn't use the type size after all but require proper alignment for load/store? Or simply to the requirements of the base language?

jeffhammond commented 4 years ago

Nobody will go along with it, but I'd be fine with a minimum RMA alignment of 128 bytes, which is the PowerPC cache line size, because all reasonable use cases are fine with that.

jdinan commented 4 years ago

I think we are rediscovering why people say "naturally aligned" without trying to define it. :neckbeard:

devreal commented 4 years ago

@jeffhammond I'm afraid you're right on that :D it would definitely break any backwards compatibility and be a waste on the ubiquitous embedded systems...

@jdinan I believe "type size rounded up to the next power-of-two" should be OK.

pavanbalaji commented 4 years ago

@jdinan I believe "type size rounded up to the next power-of-two" should be OK.

@devreal I don't think so. Following my same example as earlier, some platforms require only an 8-byte alignment for long double. Furthermore, this can be different from one language to another. I agree with @jdinan that it's unnecessary to define what natural alignment is, and would likely result in erroneous specification. We should simply state that users should follow the alignment requirements specified by the language whose bindings they are using MPI with.

devreal commented 4 years ago

We should simply state that users should follow the alignment requirements specified by the language whose bindings they are using MPI with.

That's fine with me. We would need to limit to the officially supported language bindings though, right? Otherwise a hypothetical language with 4B alignment for any type would leave implementations written in C stranded...

pavanbalaji commented 4 years ago

We should simply state that users should follow the alignment requirements specified by the language whose bindings they are using MPI with.

That's fine with me. We would need to limit to the officially supported language bindings though, right? Otherwise a hypothetical language with 4B alignment for any type would leave implementations written in C stranded...

It would be the responsibility of each language bindings to do the right thing for that language. For example, if Fortran had different alignment requirements than C, then it would need to add additional code to either make sure the buffer is "C aligned" or copy data to match the alignment as needed before calling the C versions of those functions. The users would not have to worry about any of this. They would simply use the alignment requirements of the language they are using.

jeffhammond commented 4 years ago

@devreal Requiring greater alignment does not break backwards compatibility of MPI libraries, because no code can be broken by that. It's true that code written to assume greater alignment won't work with older MPI libraries, but so also will my MPI-3 RMA code not work with LAMMPI.

jeffhammond commented 4 years ago

RMA needs to support the natural alignment of MPI_LONG_DOUBLE_COMPLEX, which can be as much as 32B. This is only a factor of 2 away from an x86 cache line, but I don't want to be parochial, hence my request for 128B.

jdinan commented 4 years ago

@jeffhammond Are you suggesting that every MPI buffer be aligned on a 128B boundary? How will you send/recv/rma elements in an int (4B) or double (8B) array?

jeffhammond commented 4 years ago

I am suggesting that every buffer returned by MPI_Win_allocate(_shared) return a buffer aligned to 128B.

I have made no statement on the alignment requirements for any communication operations. I am certainly not saying that MPI_Put requires a 128B-aligned input.

Obviously, more than 32B is overkill for correctness but cache-line alignment has a very positive impact on performance. If I allocate two 8B windows and MPI allocates them consecutively in memory, and I then bang on them with MPI_Fetch_and_op, the performance will be crap in almost all multiprocessing environments based on cache-coherent processors (I get that these aren't your thing anymore :-P).

jdinan commented 4 years ago

I think we had been talking here about the alignment requirements of buffers passed to MPI routines.

It does look like we also failed to specify alignment requirements for MPI_Win_allocate(_shared). I think the right thing to do here is to copy malloc -- "The allocated memory is aligned such that it can be used for any +predefined MPI+ data type." -- and allow (but not require) implementations to further pad alignments for the reasons you mentioned.

jeffhammond commented 4 years ago

That works for me. I confess to not reading the entire thread in spite of losing my eidetic memory for MPI Forum discussions.

devreal commented 4 years ago

It does look like we also failed to specify alignment requirements for MPI_Win_allocate(_shared). I think the right thing to do here is to copy malloc -- "The allocated memory is aligned such that it can be used for any +predefined MPI+ data type." -- and allow (but not require) implementations to further pad alignments for the reasons you mentioned.

This has been added to MPI 4 as part of https://github.com/mpi-forum/mpi-issues/issues/121. The language is "at least the alignment required for load/store accesses of any datatype corresponding to a predefined MPI datatype." That is specified for MPI_Alloc_mem and referenced for MPI_Win_allocate(_shared).

devreal commented 4 years ago

Looking at the definitions in the C/C++ standard drafts, the alignment requirements are indeed implementation-defined. So yes, natural alignment is off the table. Here is a definition in terms of the language implementation used to call the MPI procedure:

The effective memory address of any object of fundamental type in local memory provided to an MPI procedure for which the address is determined by a user-provided MPI datatype and absolute address must meet the alignment requirements of the corresponding type in the implementation of the language from which the MPI procedure is called. Any choice argument for which no MPI datatype is provided must meet the alignment requirement of the expected datatype described in the corresponding procedure definition. The outcome of operations resulting in the use of effective addresses that do not meet the alignment requirements of the language implementation is undefined.

Advice to users. Portable applications conforming to either the ISO C or Fortran standard implicitly adhere to these alignment requirements, e.g., by avoiding unsafe pointer conversion. (End of advice to users.)

This is now explicitly limited to objects in local memory because the situation for RMA seems a bit trickier. Consider a heterogeneous system where the target has different alignment requirement from that of the origin (e.g., different architecture, different language bindings). At the target, the application may store objects in window memory with less strict alignment. It would be impossible for the origin to provide a target offset for which the operation is guaranteed to be well-defined according to the rules above. The RMA chapter needs an additional sentence such as the following:

The effective target address of any object of fundamental type for which the address is determined through a user-provided MPI datatype and target displacement must meet the requirement of the corresponding type in the implementation of the language through which the window was created at the target.

Here is another caveat: it might be that third-party bindings for languages not officially supported by the MPI implementation that have less strict alignment requirements may not be able to use MPI RMA because they cannot meet these requirements (e.g., the binding for language X can use temporary buffers to pass data aligned for use with the C language implementation to MPI_Send or MPI_Put but it may not be able to control the placement of objects in local window memory through language-level constructs). That may be a hypothetical though and such bindings could use MPI_BYTE for put/get and emulate accumulates using locks.

mpiwg-rma / rma-issues

Clarify alignment requirements #3