Counting semaphore variant of lock API

nspark commented 2 years ago

The current locking API provides a fair-queued lock across the PEs, allowing one PE to hold the lock at a time. There is user interest in supporting a variant that supports N PEs to hold the "lock" at a time, as with a counting semaphore. Similar fair-queuing, FIFO behavior is strongly desirable.

One significant use-case is for parallel file I/O. With large PE counts, user applications can often see better performance for parallel writers if only some of the PEs write at a time. Such an API would provide a fair mechanism for PEs to enqueue as writers (as with shmem_*_lock), while also supporting multiple, concurrent writers (a new capability).

nspark commented 2 years ago

One can provide such a lock using the existing lock routines and AMOs, roughly as follows:

static long lock = 0;
static long active_counter = 0;
const int root_pe = shmem_n_pes() / 2;

// Acquire the counting semaphore
// -- (Wait to) acquire the lock
shmem_set_lock(&lock);
// -- Bump the active counter
long num_active = shmem_atomic_fetch_inc(&active_counter, root_pe);
// -- If (one) too many are active, wait-loop
unsigned int dt_us = 10;
while (num_active >= max_active) {
  usleep(dt_us);
  dt_us = (dt_us > 25000u ? 50000u : (2 * dt_us));
  num_active = shmem_atomic_fetch(&active_counter, root_pe);
}
// -- Release the lock
shmem_clear_lock(&lock);

do {
  something_useful();
} while (!done);

// Release the counting semaphore
shmem_atomic_add(&active_counter, -1, root_pe);

There is some potential for optimization here, but hopefully this gives a good sketch of how this can be done. The shmem_*_lock API provides the FIFO queueing for the semaphore's counter, while the counter itself allows more than one PE to access the guarded section.

jdinan commented 2 years ago

Would the lock type of the proposed API also be long or something else? IIUC, the above implementation would require more space than we have in a single long.

tonycurtis commented 2 years ago

On Feb 13, 2022, at 11:46 AM, James Dinan @.***> wrote:

Would the lock type of the proposed API also be long or something else? IIUC, the above implementation would require more space than we have in a single long.

Could we do this by using a long* lock variable as a hash key to a private (bigger) structure?

Tony

jdinan commented 2 years ago

I think we would need a point where we could symmetrically allocate memory for the lock. Locks are global objects that aren't tied to a team, so this could be tricky. We do have extra space in the MCS lock algorithm that we used in SOS. We assume that sizeof(long) == 2*sizeof(int). Both ints are used on PE 0 and there is one int free on every other process. So you could place the counter on another PE (leaving single PE jobs as an exercise to the reader). That gives you a 32-bit counter if we assume LP64 ABI. Hopefully this is enough space for the counter; we do need a way to describe the range of the counter to the user, especially if it could be less than the maxval of the function argument used to pass it.

However, if we're going to introduce new API, I would really prefer to fix this design issue with the legacy lock API.

nspark commented 2 years ago

However, if we're going to introduce new API, I would really prefer to fix this design issue with the legacy lock API.

Do you mean you'd rather the lock routines use an opaque type?

nspark commented 2 years ago

Some working group comments:

Generally favor opaque type; prevents abuse of assumptions on types/ABIs
Good to have a static initializer
Maybe collapse "traditional" locks together with "counting" locks
Consider support for locks from multiple threads in a PE

openshmem-org / specification

Counting semaphore variant of lock API #485