Polymorphic interfaces for resource creation and usage

gparmer commented 6 years ago

This is a work in progress, and mainly meant to start a thought process.

The low-level APIs of the system should be polymorphic so that they can be implemented in one of N ways, selected by the context (see #353). Thus, we need a set of APIs that work in each of the contexts. This issue is meant to start the discussion about which resources we're talking about, which contexts they should span across, and the shape of those context's APIs.

The resources include at least:

Threads -- creating, exiting, timing
Synchronization -- blocking, waking, locking
Asynchronous communication channels -- creating, deleting, sending, and receiving
Events -- event notification and triggering, aggregation and multiplexing
Memory -- map, unmap, shm
Components -- create, delegate, revoke, augment the other APIs with per-component

Implementations

We'd like to support at least a few different implementations of each of these including:

Library with allocation-only, dynamic, or static memory management policies.
Multi-component implementation.
Hierarchical implementation when appropriate.

This is a challenge as the API must be equally applicable to scenarios that include multiple components as well as library-based solutions. Correspondingly, the main challenges involve namespacing, and polymorphic representation.

Threads

Straightforward thread manipulation functions:

typedef word_t thd_t;
#define THD_NULL 0;
thd_t thd_create(fn_t f, void *data);
int thd_exit();
typedef word_t thdid_t;
thdid_t thd_id();
int thd_param_set(thd_t t, thd_param_t p); // use existing functions to configuration sched params
int thd_periodic_wait();  // relies on a periodicity being set via `thd_param_set`
int thd_sleep(microsecond_t how_long);

A subtle problem with this API is that it both creates the thread, and executes it in a single function. This might be OK in many cases, but in others, it is not appropriate. For example, perhaps we don't want to run the thread until its parameters are set. Perhaps we want to create a clone of a thread for a checkpoint, and don't want to run it yet (see the component API below).

thd_t thd_create_wait(fn_t f, void *data);
// used in conjunction with thd_wakeup below

Synchronization

A lower-level API for coordinating synchronization between threads is blockpoints. These simply enable threads to block waiting on a specific blockpoint, and to wakeup those threads that have previously blocked (either one, or all).

typedef word_t blkpt_id_t;
typedef word_t blkpt_epoch_t;
int thd_block(blkpt_id_t id, blkpt_epoch_t e, thdid_t dep);
int thd_wakeup(blkpt_id_t id, blkpt_epoch_t e);

I believe this API enables at least three separate libraries for locking and synchronization. First, a blockpoint abstraction to go with the low-level support that abstracts away the epoch.

// blockpoints: an abstraction for sleeping and waking
struct blkpt {
    blkpt_id_t id;
    blkpt_epoch_t epoch_contention; // track the epoch, and a bit for contention
};
int blkpt_init(struct blkpt *blkpt); // always embedded in other abstractions
void blkpt_wakeup(struct blkpt *blkpt);
void blkpt_block(struct blkpt *blkpt);
int blkpt_teardown(struct blkpt *blkpt);

Of course, locks. Likely need to rethink this and make additions in light of multicore.

struct lock {
    thdid_t holder;  // lock holder thread id
    struct blkpt blockpoint;
};
struct lock *lock_create();
int lock_delete(struct lock *l);
int lock_init(struct lock *l); // don't allocate memory
void lock_teardown(struct lock *l); // don't deallocate memory

void lock_take(struct lock *l);
void lock_release(struct lock *l);

Condition variables, though not that desirable, are common. For backwards compatibility (should be updated to mimic POSIX):

struct cv;
struct cv *cv_create();
void cv_delete(struct cv *cv);
int cv_init(struct cv *cv);
void cv_teardown(struct cv *cv);

int cv_wait(struct cv *cv, struct lock *l);
int cv_signal(struct cv *cv);

Channels

We want a channel interface to enable controlled asynchronous communication between different components. @phanikishoreg has been using this in Chaos to communicate between subsystems. Each channel has two major aspects:

A shared memory region used as a ring buffer to pass data.
A rcv/asnd pair in each component.

Ideally, they would be uni-directional, and not enable information to pass the opposite direction. However, this uni-directional information flow isn't possible if we have shared memory due to cache timing channels. Thus, the interface here must allow shared memory, or invocations to another component to mediate data-transfer.

Further, it should allow channels to be created and used within a single component, and created across multiple other components.

typedef word_t chan_t;
chan_t chan_create(unsigned int objsz, unsigned int nobjs, fn_t f, void *data);
chan_t chan_init_snder(unsigned int *objsz);
chan_t chan_init_rcver(fn_t f, void *data, unsigned int *objsz);
int chan_close(chan_t c);

typedef enum { CHAN_NONBLOCK = 1 } chan_flags_t;
int chan_snd(chan_t c, void *obj, chan_flags_t flags);
int chan_rcv(chan_t c, void *obj, chan_flags_t flags);
void *chan_peek(chan_t c, int offset);

In addition to communication of binary data, channels can be used to pass system objects, such as other channels between component end-points. The following example shows an expanded API that isn't very inspired. Additional examples might be used to pass other resources (components, memory, threads, etc...).

// light typing around the channels; are we sending data, or channels?
chan_t chan_chancreate(unsigned int nchan, fn_t f, void *data);
int chan_chansnd(chan_t c, chan_t to_snd, chan_flags_t flags);
int chan_chanrcv(chan_t c, chan_t *to_rcv, chan_flags_t flags);

We'll likely need modifiers on channels to allow synchronous communication with amortization of interaction costs (like call and reply_and_wait in L4), and the ability to decouple the channel's shared memory from the event notification.

Some notes:

Using unsigned ints here so that they have a consistent size if we have to go through a sinv (as opposed to size_t, for example).
Using chan_t as an opaque type of word size so that it can be a pointer, a capability, or an opaque id, depending on if it is provided by a library, a specific sinv capability, or an offset into another component's abstraction's namespace.
chan_create implicitly creates a rcv and snd, and the receive is processed by a new thread running function f. Are we OK with a callback API instead where the function is called whenever an event triggers, and we don't define in which thread context it is invoked?
chan_*_retrieve enables a component to get all of the "initial" channels (both send and receive) gifted to it on bootup. chan_init_snder returns both the channel identifier, and the object size, and chan_init_rcver creates a thread, and passes it data and objsz.
chan_poll returns if there is data available to read off of the channel.
How can interrupts be integrated with channels, or what other abstractions should we use for that?

Some problems and further work TODO:

This API is not general enough to generalize channels that enable the reception of the most fresh data value at any point.
It is unclear from this API if it is reasonable that a rcv from a channel should only be performed in the thread created for that channel.
This does not well-generalize across intra-component and inter-component channels. Core problem is namespacing across components.
The big question: are these SPSC channels? Do we need some part of the API that configures this?

Event Notification

The ability to get a notification if any of the channels of interest become available (non-empty, to snd to, or if they have data in them to rcv from). This API centers around evtgrp_wait.

typedef word_t evt_t;
evt_t evt_create(unsigned int max_chan);
int evt_add(evt_t g, chan_t c);
void evt_remove(evt_t g, chan_t c);
void evt_delete(evt_t g);

chan_t evt_wait(evt_t g);

Concerns and TODO

The main concerns about this design is a lack of coherency across abstractions.

We have two synchronization primitives: blockpoints and rcv capabilities. Can these be unified?
The event notification mechanisms are focused on channels. Can they accommodate locks?
Do we require a pairing between channels and threads? If not, how can rcvs be integrated? If so, then how do we allow events to span multiple channels? Does this require intermediate threads and multiplexing?
How does this all integrate with IPI rate-limiting and polling?
What is the synchronization hierarchy? Should locks be implemented with channels? A few options ("->" is a dependency, and "(a, b)" means that both "a" and "b" are parallel -- not dependent on each other):
- (scheduler tokens, sched events) -> atomic instructions -> sched locks -> block points -> (locks, channels)
- (scheduler tokens, sched events) -> atomic instructions -> (sched locks, block points -> channels -> locks)
- (scheduler tokens, sched events) -> atomic instructions -> (sched locks, block points -> (channels, locks)

Memory Allocation and Sharing

I'm not sure what the API should be here. On the one hand, we have the cbuf API that allows very efficient dynamic memory sharing, and on the other, we have eos which has a static region of shared memory, tracked efficiently with the equivalent of DMA ring buffers. For the former, see the cbuf API.

EOS-like channels. The API for this can be either slightly modified versions of channels, or hidden completely by the channel API, if we add some protection information (who should be able to read which memory).

However, we need to have an API for decoupling the async properties of the default implementation from the shared memory (channel-like) properties. This leads me to believe we want channels that provide synchronous and bi-directional communication, likely with call and reply_and_wait amortizing calls.

General shared memory. Independent of communication, we want the ability to map pages. This might be in response to page-faults, to create components, or simply to alias text segments. The core question for this is one of access control: which components should be allowed to map their pages into other components. At the lowest-level, we, of course, have the resource-table-based nested access control. We want the API to apply in that scenario, but we'd also like the API to be able to backed by invocations to the capmgr.

Device memory. How can we integrate device memory into this system? How can the system management abstractions be portable across x86 and arm?

Notes on the co-design of channels, event notification, memory management, and scheduling

Channels are the API where the entire world of resource management comes to play. They can span from self-contained, simple and fast (spsc, async-only, pre-defined size message passing) to complex, bringing in most system mechanisms and policies (zero-copy, possibly-synchronous, with event notification, without redundant events). Given that we're taking the cost of scheduling to zero, we need to very intentionally plan all of this to maintain end-to-end performance.

This section is here only to start a discussion about this.

I believe that all channels should share a number of properties:

Bounded size. They should all be asynchronous by default, and synchronous when the buffer is fill (with not obvious edge-case where the buffer has 0 items to become fully synchronous).
Poll-based API options. They should all provide APIs to enqueue/dequeue, reporting failure if the operation is not possible as the channel is fully/empty, respectively.
SPSC by default. As channels are designed to go between trust-domains, locking between producer and consumer is not tenable. The wait-free properties of SPSC is necessary here.
Possible MPMC. Often by simply relying on locks on either end. Event notification is the difficult design here as it naturally aggregates, and requiring that it aggregates only across channels with a single producer is limiting.

A few different channel implementations that we'd like to accommodate:

Message passing. Bounded messages, copied into and out of the buffer (see the lwt impl). SPSC with edge-case synchronous behaviors.
Message passing with in-message sync. To avoid a cache miss on head/tail, we can use two bits in the message. In the naive case, this expands the message by a word. More scalable, but makes memory layout assumptions.
Level-triggered notifications. Bounded message with some identifier. When a notification is enqueued, check if the notification already exists in the queue, and don't redundantly enqueue it if it does. Must be O(1). Directly useful in the event notification queues (allowing them to be bounded WRT the number of message queues they aggregate).
By-reference message passing. The challenges here are 1. what are the lifetime properties of passed objects, and 2. what are the allowed sharing properties of the producer and consumer? I see no reason right now to not assume that the lifetime is passed with the object. (Am I wrong?) If you want shared memory for coordination, use shared memory (hopefully with parsec). The shared memory should be based on a slab allocator, with ring buffers to describe each item in the (bounded) slabs. Memory allocation/deallocation inter-ops with the rings. The rings can be shared between producer and consumer (R/W), and the shared memory can be R/W or RO. Challenges here include if the memory is consumed by a series of components...which deallocates it, and how is it known who will? In short, a ring tracks all free buffers, and a number of rings hold those that have been deallocated, but not yet stored as "free". The ring must be created, referencing the arena of the memory.

Note that some of these are quite composable. The rings in the last can be implemented using the first.

Components

Of course, the management of components is foremost in the system as the isolation properties are a significant portion of our system's value. We'll likely want to expand on this, but to start:

typedef word_t comp_t;
typedef enum { COMP_FLAG_FREEZE } comp_flags_t;

// API for creating the memory and capability namespaces for a component
comp_t comp_create(elfobj_t obj, comp_flags_t flags, /* dependencies...*/);
comp_t comp_copy(comp_t, comp_flags_t flags);
comp_t comp_restore(comp_t target, comp_t chkpt); // `target` was previously `comp_copy(chkpt)`ed
int comp_delete(comp_t c);

// API for thread execution in the component.
thd_t comp_boot(comp_t c);
thd_t comp_boot_chan(comp_t c, chan_t to_rcv); // draft...many options here
thd_t comp_thd_create(comp_t c, thdclosure_index_t idx); // used only for hierarchical thread mgmt, perhaps shouldn't list it here.
// copy the register state from a thread to another where both components are related.
int comp_thd_restore(comp_t dest, thd_t clone_to, comp_t src, thd_t clone_from);

This should be able to support fork semantics (paired with a thread function to modify thread register state), component creation as we currently define it, and checkpoint-restore as in EOS. It should have implementation-specific options for library-based efficient implementations, and cross-library, process-manager implementations.

phanikishoreg commented 5 years ago

Thank you for creating this issue and the information included is great!

I've a few things to talk about, and I'm going to read this again and again and think if we need anything else for abstractions or API.

Threads I'm confused about the thd_t and thdid_t here. They're both word_t and it looks like thd_t is only used for param_set. Where would the state data for a thread be kept? I'd think that we'd need some sort of struct thd here to hold thdid_t, etc. Perhaps I'm missing something, this abstraction is only for resource management and not involving scheduling?

I'd think that this API is for both resource management and as a scheduling interface, so I'd modify the API slightly:

thd_t thd_curr(void);
thdid_t thd_id(thd_t);

int thd_yield(void);

Synchronization

Similar confusion here, blkpt_id_t and blkpt_t. I can see the difference in usage and API, but I'm still not clear how these abstractions interplay.

Channels

My main problem with this API is that there is no "shared" key or namespace identifying the two communicating parties like we have channelkey_t in the current API. I think we'd need some sort of a shared key that is statically laid out so either party know whom they're talking to. Without that, they're more of a dynamic channels and that is fine if there is a co-ordinator that tells either party that they should use a specific channel here. I think dynamic channels vs static channels have different use-cases. For ex, I'd think that dynamic channels could be used for inter-vm communication, however static channels are more useful for a peer-to-peer communication where they both know each-other.

I think this is a great start for a discussion, and I'm willing to meet any day to talk about these. We could perhaps set up a meeting for Tuesday (tomorrow?) or sometime this week to discuss this, maybe use the Wednesday (5-6) timeslot and talk?

This is a good discussion to have especially because I think this could form a good foundation to the RTOS layer we were talking about recently.

hungry-foolish commented 5 years ago

The main concerns about this design is a lack of coherency across abstractions.

Suggestion: implement few (maybe 2): per-thread blockpoints and semaphores communication mechanisms and construct all higher level interfaces fom there. If so, the high-level interfaces are free from inherent polymophism considerations, thus simplifying the whole design.

We have two synchronization primitives: blockpoints and rcv capabilities. Can these be unified?

If we dictate that one thread have one blockpoint, then it might be possible.

The event notification mechanisms are focused on channels. Can they accommodate locks?

Need more discussion

Do we require a pairing between channels and threads? If not, how can rcvs be integrated? If so, then how do we allow events to span multiple channels? Does this require intermediate threads and multiplexing?

Need more discussion

How does this all integrate with IPI rate-limiting and polling?

In this regard, we just need to limit the rate of per-thread blockpoint. This is kind of already implemnted by phani.

What is the synchronization hierarchy? Should locks be implemented with channels? A few options ("->" is a dependency, and "(a, b)" means that both "a" and "b" are parallel -- not dependent on each other): (scheduler tokens, sched events) -> atomic instructions -> sched locks -> block points -> (locks, channels) (scheduler tokens, sched events) -> atomic instructions -> (sched locks, block points -> channels -> locks) (scheduler tokens, sched events) -> atomic instructions -> (sched locks, block points -> (channels, locks)

Why are channels the last level of dependency? I think blockpoints may be better. We need to discuss about this. I probably misunderstood. a->b means a is dependent on b or we need a to form b thus b dependent on a?

Extra comments:

Consider implementing thd_create as a combination of thd_wait_create and thd_wakeup: thd_create { thd_wait_create thd_wakeup }; would this be better?
Regarding channels - We now have endpoints, so maybe we can consider implementing the channel and event group with endpoints(rather than implementing themselves from scratch).
For pairing between threads & endpoints, maybe we can consider some sort of thread mailboxes.
I also have the same confusions as Phani. What are these types for....
What is thd_periodic_wait for?

6.Decoupling memory allocation and creation might be a better design in this API. This simplifies the API.

These chan_chancreate functions are confusing... If the user want them, let themselves implement them. The channels can pass anything, and that's it.
There might be real-time issues regarding event notifications.

This is a great issue to discuss I think, and we can meet to talk about it tomorrow(Tuesday).

gparmer commented 5 years ago

I added some thoughts above about the types of channels we need to support. The hardest part of this is going to be coordination between shared memory allocation and channels.

Inspired by rust's channel's API, I think it would be good to separate sender's data-structure from receiver. I also now see it as inevitable that we'll have to separate between channels that are guaranteed-local to a component, and those that are not. Something like:

typedef word_t chan_snd_t;
typedef word_t chan_rcv_t;
typedef word_t chan_t; // polymorphic representation of either of the previous

typedef enum {
    CHAN_FLAG_SND     = 1,
    CHAN_FLAG_RCV     = 2,
    CHAN_FLAG_CHAN    = 4,      // channel to send channels
    CHAN_FLAG_LOCAL   = 8,      // channel only accessed locally within this component
    CHAN_FLAG_MP      = 16      // possible multiple producers
} chan_flags_t;

int chan_create(unsigned int objsz, unsigned int nobjs, chan_flags_t flags, chan_snd_t *snd, chan_rcv_t *rcv);
chan_snd_t chan_snd_dup(chan_snd_t snd);  // multisender channels
void chan_rcv_close(chan_rcv_t c);
void chan_snd_close(chan_snd_t c);

int chan_snd(chan_snd_t snd, void *obj);
int chan_rcv(chan_rcv_t rcv, void *obj);
// non-blocking variants
int chan_snd_nb(chan_snd_t snd, void *obj);
int chan_rcv_nb(chan_rcv_t rcv, void *obj);

typedef word_t chanchan_snd_t;
typedef word_t chanchan_rcv_t;

// flags determine if send or receive channels are sent through this channel for channels
int chanchan_create(unsigned int nchan, chan_flags_t flags, chanchan_snd_t *snd, chanchan_rcv_t *rcv);
int chanchan_snd(chanchan_snd_t c, chan_t to_snd);
int chanchan_rcv(chanchan_rcv_t c, chan_t *to_rcv);
void chanchan_snd_close(chan_snd_t snd);
void chanchan_rcv_close(chan_rcv_t rcv);

// A version of reinterpret_cast...does type checking an only casts if the proper flags are set.
int chan_to_snd(chan_t c, chan_snd_t *snd); // cast the received channel to a snd (if the types match)
int chan_to_rcv(chan_t c, chan_rcv_t *rcv);
int chan_to_chan_snd(chan_t c, chanchan_snd_t *snd);
int chan_to_chan_rcv(chan_t c, chanchan_rcv_t *rcv);

Instead of the chanchan namespace, I like chan2 (read as chan^2 = chanchan) more. Thoughts?

gwsystems / composite