[Discussion] Channel priorities and choices

raedwulf commented 7 years ago

Two options here.

Reproduced from libdill.h for discussion:

#define CHSEND 1
#define CHRECV 2

struct chclause {
    int op;
    int ch;
    void *val;
    size_t len;
    char reserved[64];
};

DILL_EXPORT int channel(size_t itemsz, size_t bufsz);
DILL_EXPORT int chsend(int ch, const void *val, size_t len, int64_t deadline);
DILL_EXPORT int chrecv(int ch, void *val, size_t len, int64_t deadline);
DILL_EXPORT int chdone(int ch);
DILL_EXPORT int choose(struct chclause *clauses, int nclauses,
    int64_t deadline);

Proposed API for message priorities (separate functions):

DILL_EXPORT int pchannel(size_t itemsz, size_t bufsz, int priorities);
DILL_EXPORT int pchsend(int ch, int prio, const void *val, size_t len, int64_t deadline);
DILL_EXPORT int pchrecv(int ch, int *prio, void *val, size_t len, int64_t deadline);
DILL_EXPORT int pchdone(int ch);

Second proposed API. Replace existing API, but then requires additional parameters which will break existing API.

#define CHSEND 1
#define CHRECV 2

struct chclause {
    int op;
    int ch;
    int prio; /* CHSEND reads prio from clauses and CHRECV writes to prio in clauses. */
    void *val;
    size_t len;
    char reserved[64];
};

DILL_EXPORT int channel(size_t itemsz, size_t bufsz, int priorities);
DILL_EXPORT int chsend(int ch, int prio, const void *val, size_t len, int64_t deadline);
DILL_EXPORT int chrecv(int ch, int *prio, void *val, size_t len, int64_t deadline);
DILL_EXPORT int chdone(int ch);
DILL_EXPORT int choose(struct chclause *clauses, int nclauses,
    int64_t deadline);

Any thoughts?

sustrik commented 7 years ago

An alternative is to have prioritized choose(), i.e. if there are multiple active clauses, always choose the one which comes first in the array. To make prioritized queue with three priority levels create three channels and use prioritized choose to receive messages from them.

raedwulf commented 7 years ago

I stepped back a little bit and did some quick reading on CSP (sadly I'm not so much of an expert in this area despite being in an office with 3 who knew CSP!!!).

Anyways, from my understanding, CSP has deterministic and non-deterministic choices. I believe what we're trying to achieve is an deterministic choice whereas the current semantics for choose are non-deterministic.

So the two approaches suggested were:

Priorities: determinism specified by priorities. This effectively labels the messages with an event (priority) which gives the process the ability to prefer the message over that event (priority). Equal priorities are resolved deterministically on a first-come-first-serve basis, but from CSP perspective, this is non-deterministic. The relationship between channels and events is M channels : N events .
Deterministic Choose: Locally specify the preference of events; determinism is resolved by preferring a channel over other channels. The relationship between channels and events is 1 : 1.

So effectively the prioritized choose explicitly defines the choice locally whereas priority numbers defines the choice globally.

They both provide similar capability, but after some thought experiments, I believe it would resolve into different overall application design.

Priorities

The lends itself to a one-channel per coroutine model. A coroutine as a specific channel that contains all the types of messages that the coroutine wants to handle. Design-wise, the thinking should be that the specification of the coroutine is that the events/messages/resources that it receives and sends are determined and specified for its single channel. This completely removes the need for the choose function.

Deterministic Choose

This lends itself to the multiple-channel per coroutine model. A coroutine uses a number of channels that have, in themselves, one specific type of message. choose allows priorities to specified over those channels at runtime.

Quick Compare

Both are equally viable. Using a deterministic choose does give the advantage that channels do not need to multiplexed into a single channel making load-balancing simpler to encode. Priorities reduce the API footprint and without a choose and the functions are individually more efficient (only 2-3 ns slower than non-priority versions of chsend and chrecv).

Performance

If performance is a concern, it is hard to say whether extracting a message from a channel and multiplexing it for a coroutine will be cheaper or more expensive than using a deterministic choose. As an educated guess it'll be roughly the same but I suspect I can make the determinstic choose faster overall. One of the major limiting factors of the existing choose is the reserved field which gets initialised to zero which is costly if declared on stack. It would be much better to allocate that memory on stack within the choose function and not initialise it.

sustrik commented 7 years ago

Some thoughts in no specific order:

In general, I prefer one-coroutine-one-input-channel model. The reason is that multiple input channels lead to heavy use of choose and from there it's only a step towards users doing scheduling by hand. Instead, we should promote leaving scheduling to libdill.
Semantics of choose() were copied from libmill which in turn doggedly follows semantics of golang. While golang may have its own reasons for doing stuff that way it doesn't necessarily translate to libdill. It's thus full reasonable to consider moving towards more deterministic semantics for choose.
When thinking about the semantics of channels it's important to take into account that there are two different kinds of scheduling involved. One is using choose() to deal with multiple channels. Other is multiplexing of multiple readers/writers to a single channel. At the moment, the first is non-deterministic while the latter is fully deterministic.

sustrik commented 7 years ago

It would be nice to see some use cases. The most prioritized queue solutions I've seen were dealing with cancelation events (e.g. OOB in TCP). However, we already have a mechanism to cancel coroutines. What are the other use cases?

raedwulf commented 7 years ago

Answering your third point above: I agree with those thoughts.

Having multiple chrecv on the same channel from multiple go effectively covers the non-deterministic use case. In some ways, this even resembles CSP more than the existing choose as it deals with non-deterministic choices over processes as opposed to events (which messages in channels seem to be the analogue to).

Hence, we're left with the deterministic choice --- and I had a moment of clarity.

The main use case for deterministic choice is to alter the behaviour of the coroutine (and thus it is OOB). Cancelling/termination is one such altering of behaviour. I don't deal with other types of events, because that can be trivially multiplexed or alternatively given to another coroutine.

As a side-effect for allowing cancelling/termination, you can achieve any deterministic choice through terminating an existing coroutine and recreating it with the desired behaviour. So in effect, you do not actually need any other OOB message form.

So what we're actually looking at is whether there are types of OOB messages which occur frequently enough that recreating the process (and all of its logic state) is prohibitively expensive or simply difficult. Quick numbers-game single-threaded (in my fastest configuration on my laptop (desktop is generally half these values)): a chsend and chrecv are 20 ns each and a roundtrip is 45 ns. Recreating a coroutine after termination from chdone is 100 ns when the logic state is negligible.

New benchmarks here: https://github.com/raedwulf/libdill/tree/choose-opt

Some contrived use-cases come to mind:

Lazy event source: the physics simulator is expensive, but needs to be run semi-realistically when human players are not present in the area. It performs a simulation tick every 200 milliseconds. A human player approaches the area and then it requires the physics to be updated more frequently to have better movement simulation thus the coroutine is changed to generate simulation ticks every 50 milliseconds.
Throttler: the messages received by the coroutine come at a high rate fills the coroutine buffer so that it's often almost full. The user sends a message to increase it's output rate to compensate for that. There's a large delay before the throttler increases its rate as the message is blocked at the back of the buffer as choose is non-deterministic.

In terms of order of magnitude, destroying/recreating a coroutine could work as usually time constraints are much higher than this. I think the main overhead is in both the amount of code and consideration on how to encode the state so that it can be moved between coroutines. Non-deterministic choose could be used as well, but it is more likely to break soft-realtime constraints.

I think the whole point of priorities are that they are to deal with real-time constraints. It is difficult to make libdill hard real-time (being cooperatively scheduled) but I think adding priorities gives some credibility to being soft real-time which I guess is an interesting use case on its own (for games and interactive simulations anyway).

raedwulf commented 7 years ago

An instance where it would be inconvenient to recreate the state of a coroutine is when you have a separate library or function that controls its own state and it has been created within the coroutine as opposed to outside of the coroutine. Arguably you could create the state outside or even preserve it between coroutine hand-off but it's still probably a lot harder than simply receiving a message that alters the behaviour of the coroutine within the coroutine.

I think it boils down to a form of "locality". I create some object in library, so when the coroutine finishes, I destroy that object. With the cancellation approach, this is no longer true -- though it is not a technical issue, it is more a design headache.

sustrik commented 7 years ago

What about this approach?

coroutine worker(int input, int oob) {
    while(1) {
        msg_t msg;
        chrecv(input, &msg, sizeof(msg), -1);
        process_input(msg);
        int rc = chrecv(oob, &msg, sizeof(msg), 0);
        if(rc == 0) process_oob(msg);
    }
}

raedwulf commented 7 years ago

Yes, I think that would work provided that process does not take too long such that the OOB was intended to affect msg and alter the output for the receiver.

It's tricky to give any real examples as I doubt many existing programs use libmill/libdill for soft realtime purposes yet so whether there is a use case where it is unacceptable to be delayed by a single message is an open question. Having said that:

coroutine worker(int input, int oob) {
    while(1) {
        msg_t msg, oobmsg;
        chrecv(input, &msg, sizeof(msg), -1);
        int rc = chrecv(oob, &oobmsg, sizeof(msg), 0);
        if(rc == 0) process_oob(oobmsg);
        process(msg);
    }
}

Would work too right?

This relies on the assumption that OOB messages are not particularly useful on their own without the real band data, so you'd only need to process OOB messages after receiving a message but before processing it.

I think we're onto something :)

sustrik commented 7 years ago

Yes, that's clever.

Conceptually, while the above may look like a cheap trick to avoid using choose(), one should visualize coroutine as a linear sequence of steps (the goal is not to do scheduling by hand, after all). From that perspective it looks pretty reasonable: Get a message, change settings if needed, process message, repeat.

raedwulf commented 7 years ago

This, in effect, emulates a dual-priority system (this is with little analysis but a good gut feeling):

coroutine worker(int input, int oob) {
    while(1) {
        msg_t msg, oobmsg;
        /* We block on data first, assume that OOB is only necessary before processing. */
        chrecv(input, &msg, sizeof(msg), -1);
        /* Process all OOB messages, emulates a higher-priority message. */
        while(1) {
            int rc = chrecv(oob, &oobmsg, sizeof(msg), 0);
            if(rc != 0) break;
            process_oob(oobmsg);
        }
        process(msg);
    }
}

Whether more-than-two priorities have some use-cases that we haven't considered is an open question. I suspect having only one-data band/channel per coroutine (i.e. typical use case), or multiple homogenous data-bands/channel (i.e. multiplexers) means that we don't need more than one OOB channel.

raedwulf commented 7 years ago

@paulofaria what are your thoughts on this as you guys are using libmill in practice.

sustrik commented 7 years ago

There's a patch to make choose deterministic in deterministic-choose branch in case you want to review it.

The reasoning is as follows: In golang, clause is chosen at random to ensure fairness among channels and prevent any of them from being starved. While that makes sense if all the channels are of the same type (i.e. they are semantically equivalent) libdill's design is based on assumption that all semantically equivalent messages are passed via a single channel (possibly with multiple senders). Therefore, if choose is used at all, the channels in question are likely to differ semantically. And if that's the case, prioritization based on type is preferable to using one type at random.

raedwulf commented 7 years ago

That looks good - I optimised the choose slightly by using VLAs as opposed to the reserved field in chclause. It optimises the use case of having few channels and allocating the clauses on the stack. https://github.com/raedwulf/libdill/tree/choose-opt

That branch is based on the non-deterministic choose but should be fairly trivial to adapt as reserved is now only used in one location - you don't even need to cast.

sustrik commented 7 years ago

Good idea. Done. If you want your micro-benchmarks included in the mainline, send me a pull request.

sustrik commented 7 years ago

While we are dismantling the golang legacy: What's the use case for buffered channels?

Quick web search yields this, but none of the use cases sounds really convincing: http://stackoverflow.com/questions/15113410/when-to-use-a-buffered-channel

raedwulf commented 7 years ago

I can imagine if an event source chsend in batches while a receiver consumes batches as well, buffers would be useful as an optimisation. However due to the way libdill works by yielding at the end of every call, and also the fact the scheduler is RR, this usecase is inefficient.

I would say in this case (and possibly other cases) that the sender and receiver should be manually responsible for buffering on their own as opposed to being automatically buffered. And to deal with batching, each item can be a variable-sized? batch... perhaps this would be an argument for variable message sizes up to some maximum message size. Without buffering, this becomes trivially possible.

Buffered channels could be useful if there is some use-case where data must be buffered quickly otherwise is disappears or some other internal buffer overflows. I cannot think of any examples of this happening however (apart from a keyboard buffer which is obviously not a libdill usecase... yet). With soft real-time constraints probably good enough now, I think the user can implement buffering if they need to.

raedwulf commented 7 years ago

Are there any cases that the buffer will be mitigating deadlocks between different channels?

paulofaria commented 7 years ago

I think buffered channels can mask deadlocks, but not prevent them.

raedwulf commented 7 years ago

Yes that is true - hence the word mitigate :P I think there would be cases where it would solve a deadlock when it depends on a number of yields to resolve. This is probably bad practice as it is not easy analyse such behaviour. This would make such behaviour fail faster :)

The more I think about it, what we should have is large item buffers where a single item can be of any size up to that buffer. For example:

channel has a 512 item buffer.
chsend sends 35 byte item.
chrecv receives 35 bytes and returns that information to the user.

These are not partial reads (or streaming behaviour) as the item block is complete, as the user actually is encoding 7 items of 5 bytes each. I think this would be beneficial for the user to perform buffering (which does actually work as opposed to the current behaviour).

sustrik commented 7 years ago

With unbuffered channels, data is copied directly from sender's buffer to receiver's buffer. Therefore, channel needs no storage and no fixed item size. Just saying.

raedwulf commented 7 years ago

Yeah that would work.

I think I'm describing partial reads/writes again (this time in the context of channels) but with the added guarantee that sender buffers/message sizes never exceed that of the receiver and that received data is complete.

Ultimately, I don't think what I'm suggesting is necessary - the time spent sending channel messages is unlikely to be the bottleneck in the application so an increase in throughput is unnecessary - unless you use a channel to send/receive large amounts of data. I think latency is more important for channels.

raedwulf commented 7 years ago

Ah okay. Github was not updating the comments for me earlier - yes that makes sense :)

sustrik commented 7 years ago

One more thought experiment with channels.

Consider single message processor reading messages from a single channel written to by many clients. One would expect scheduler to be fair. In other words, client's experience should not be disproportionally affected by the presence of other clients. In yet another words, if you send a message once in a second and then a new client arrives able to produce million messages a second, your performance should drop to 1/2, not to 1/1000000.

With unbuffered channels this is trivially the case. Channel is effectively doing round-robin among the clients.

With 1000-item buffered channel the situation is different: Given that the other client sends so many messages, each time you want to send your message you'll encounter a full buffer. Thanks to round-robin you'll get a fair change to write to it, but there are still 999 items ahead of you in the line! That's bad news for your latency.

The takeaway is that queues introduce some non-trivial behaviour that may not be appropriate for a low-level synchronization primitive like channel. Maybe it would be better to have unbuffered channels and build queues on top of that?

Other thoughts:

Buffered channel can be used as a semaphore. It probably shouldn't be. If you want a semaphore, use semaphore.
As already said, buffering often makes deadlocks less frequent (as opposed to eliminating them). That makes reproducing and debugging them super-hard. Let's not go that way.
Channel's item size if a poor man's type system. It's annoying that we cannot have channel<int> in C, but that's the nature of the language. We can go in opposite direction and make channel transfer arbitrary chunks of data. If we do so, the real challenge is how to deal with buffer overflows (i.e. when chunk sent is larger than receiver's buffer).

raedwulf commented 7 years ago

After thinking about this a bit too I think making channels unbuffered is the way to go.

My thoughts seem to align with 1 and 2. For 3, I'm thinking that channel can allow data up to some channel max item size. So the item length is sent as part of the buffer and len <= ch->itemsz. The semantics would be that the items must be complete as far as libdill is aware - it is for cases where it is unnecessary to copy the max item size each time and read it....

Buffer overflows will effectively mean that there is an item waiting in the receiver and the sender is blocked waiting on that. This may require sort of yield spin (this is what buffering avoids I guess).

Backtracking slightly maybe we should forget for the moment the fact channels can transmit data at all (and forget about the arbitrary sizes) and think about channels as a means of transmitting control signals. Data can always be transmitted through different means of we use channels for synchronisation.

Buffered control signals have your stated issues but do reduce the effect if the busy yield problem (when the receiver is full but you need to synchronise). I'm on my phone at the moment but it should be possible for the receiver to wake up the sender directly after receiving the message.

EDIT: Maybe I am getting confused in my own thoughts as there's some ambiguous terminology which means we're crossing wires quite a lot, so I'll list the types of channel use we have in the next comment.

raedwulf commented 7 years ago

Use cases:

Channel as a messages queue
Channel as a synchronisation mechanism

Implementation

Buffered within channel
No buffer within channel (but receiver passes a buffer)

Questions:

Do we want channels to be a message queue?
Can message queues be implemented with channels just being a synchronisation mechanism?
Would a combination of blocking on system file descriptors and channels cause a deadlock that would otherwise be solved having a buffered channel?
Leading on from 3. Are there design patterns involving small cycles of resource contention which a buffer larger than the cycle would solve? Is there a way to solve it without using buffered channels? I'm unfamiliar with cooperative synchronisation semantics, so I don't know if this is an existing paradigm or even possible. Perhaps if someone can produce an example or counter-example of this, it would be very useful!

Edit: Go channels seem to face more problems than we do because Go has the problem of being run in multiple kernel threads so thread-safety becomes more difficult than it does for us: http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-feel-bad/

His solution with mutexes is already implicitly provided by libdill by the fact everything is mutexed between yield points in a single-threaded cooperatively scheduled coroutine.

To add to 3 and 4, I'm not advocating the use of buffered channels but the fact that they exist and in libmill, libdill and Go as opposed to using unbuffered channels to begin with means that we've been using one set of semantics all the time. Removing that set of semantics may cause deadlock problems to crop up so how can we solve it when the channel is unbuffered (if buffered channels were the solution to begin with).

raedwulf commented 7 years ago

One thought about implementing your own buffer on top of unbuffered channels would be to try and chsend and detect failure, then store into a local buffer and then continue your loop.

raedwulf commented 7 years ago

With discussion with @paulofaria earlier about an example of buffered channels (using Venice which in turn uses libmill):

https://github.com/Zewo/Venice/blob/master/README.md#10---iterating-over-channels

We considered what it would mean in the context of an unbuffered channel, in particular the statement: "This example also showed that it’s possible to close a non-empty channel but still have the remaining values be received."

My reply was...

Designing with a unbuffered channel depends on whether (in context of the sender/receiver):

blocking can be acceptable if the statements after the blocking/close need to be executed immediately i.e. the channel needs to be unblocked
there is another receiver already present to consume those sent items

Two solutions that come to mind with unbuffered channels:

implement buffering by doing a fast fail and fire off a separate coroutine to handle negotiation with the receiver
design it such that the receiver is always present before that code is executed

Coroutines are cheap, very cheap (in the context of libdill)... they are as quick to launch as sending a message. They differ in only a few percent or in the worst case they're still in the same order of magnitude So another way of thinking about it is that the buffer (copying) can be described as a lightweight process (the memcpy essentially). The only difference is that the buffer stores the results somewhere intermediate and a second process (the receiver) starts reading it afterwards.

Edit: to add a bit of context for non-libdill users, a single libdill context is single-threaded so we don't have concerns where channels are used on parallel execution (real) threads.

paulofaria commented 7 years ago

I'd love to hear what @robpike has to say about this. I have a feeling we can do away with buffered channels as I think there's nothing we can't do without them. I just wonder about the complexity of a solution without buffered channels. something like the example on the link:

https://github.com/Zewo/Venice/blob/master/README.md#10---iterating-over-channels

How would we do it without unbuffered channels?

paulofaria commented 7 years ago

Sorry for repeating the link, haha. I posted before refreshing the page. @raedwulf was faster.

sustrik commented 7 years ago

So another way of thinking about it is that the buffer (copying) can be described as a lightweight process (the memcpy essentially). The only difference is that the buffer stores the results somewhere intermediate and a second process (the receiver) starts reading it afterwards.

I like that. What it means for the end user is that instead of having a single channel object that can be either buffered or unbuffered they would have two objects: channel (unbuffered) and queue (buffered, implemented as a coroutine):

int ch = channel(sizeof(int));
int q = queue(sizeof(int), 1000);

One benefit of this model would be that queues could be somewhat more heavy-weight. People may want stuff like auto-growing queues, queue size reporting etc.

Another benefit: With fixed-size channel object it is possible to allocate it on stack in scenarios where performance is paramount:

channel_storage_t s;
int ch = channel_stack(&s);

Interesting tangential observation: Implementation of queue is an actual use case for choose() statement.

As for the buffers passed via channels, I am inclined to keep the fixed item size (a.k.a. poor-man's type system). The reason is that otherwise we get a usability problem. Each call to chrecv would have to look like this:

size_t sz = chrev(ch, &val, sizeof(val));
if(sz < 0) do_error_handling();
if(sz != sizeof(val)) do_error_handling();

Having one line to check the result is enough for a lot of people to hate C. Having two checks makes it even worse.

That being said, if the user wants to pass arbitrarily sized buffer, they can just pass a pointer.

sustrik commented 7 years ago

Are there design patterns involving small cycles of resource contention which a buffer larger than the cycle would solve? Is there a way to solve it without using buffered channels? I'm unfamiliar with cooperative synchronisation semantics, so I don't know if this is an existing paradigm or even possible. Perhaps if someone can produce an example or counter-example of this, it would be very useful!

That's the semaphore use case. You can devise synchronization mechanisms where invariant is, for exampe, "there are at most 2 messages in the queue". These are pretty rare though.

raedwulf commented 7 years ago

That being said, if the user wants to pass arbitrarily sized buffer, they can just pass a pointer.

Yes, this would be the building block for queue.

Having one line to check the result is enough for a lot of people to hate C. Having two checks makes it even worse.

I find that this makes developing error-free code so much easier (local information of what function's error you are handling)! Exception-handling tempts one to be sloppy in handling specific errors because catch-alls are so easy to implement.

raedwulf commented 7 years ago

Are you planning to add queue within libdill or this user-implemented functionality? If it is going to be added to libdill I'll keep this open, if not I guess that this issue can be closed.

sustrik / libdill