Open raedwulf opened 7 years ago
An alternative is to have prioritized choose(), i.e. if there are multiple active clauses, always choose the one which comes first in the array. To make prioritized queue with three priority levels create three channels and use prioritized choose to receive messages from them.
I stepped back a little bit and did some quick reading on CSP (sadly I'm not so much of an expert in this area despite being in an office with 3 who knew CSP!!!).
Anyways, from my understanding, CSP has deterministic and non-deterministic choices. I believe what we're trying to achieve is an deterministic choice whereas the current semantics for choose
are non-deterministic.
So the two approaches suggested were:
So effectively the prioritized choose
explicitly defines the choice locally whereas priority numbers defines the choice globally.
They both provide similar capability, but after some thought experiments, I believe it would resolve into different overall application design.
The lends itself to a one-channel per coroutine model. A coroutine as a specific channel that contains all the types of messages that the coroutine wants to handle.
Design-wise, the thinking should be that the specification of the coroutine is that the events/messages/resources that it receives and sends are determined and specified for its single channel.
This completely removes the need for the choose
function.
This lends itself to the multiple-channel per coroutine model. A coroutine uses a number of channels that have, in themselves, one specific type of message. choose
allows priorities to specified over those channels at runtime.
Both are equally viable. Using a deterministic choose does give the advantage that channels do not need to multiplexed into a single channel making load-balancing simpler to encode.
Priorities reduce the API footprint and without a choose
and the functions are individually more efficient (only 2-3 ns slower than non-priority versions of chsend
and chrecv
).
If performance is a concern, it is hard to say whether extracting a message from a channel and multiplexing it for a coroutine will be cheaper or more expensive than using a deterministic choose
. As an educated guess it'll be roughly the same but I suspect I can make the determinstic choose
faster overall. One of the major limiting factors of the existing choose
is the reserved
field which gets initialised to zero which is costly if declared on stack. It would be much better to allocate that memory on stack within the choose
function and not initialise it.
Some thoughts in no specific order:
It would be nice to see some use cases. The most prioritized queue solutions I've seen were dealing with cancelation events (e.g. OOB in TCP). However, we already have a mechanism to cancel coroutines. What are the other use cases?
Answering your third point above: I agree with those thoughts.
Having multiple chrecv
on the same channel from multiple go
effectively covers the non-deterministic use case. In some ways, this even resembles CSP more than the existing choose
as it deals with non-deterministic choices over processes as opposed to events (which messages in channels seem to be the analogue to).
Hence, we're left with the deterministic choice --- and I had a moment of clarity.
The main use case for deterministic choice is to alter the behaviour of the coroutine (and thus it is OOB). Cancelling/termination is one such altering of behaviour. I don't deal with other types of events, because that can be trivially multiplexed or alternatively given to another coroutine.
As a side-effect for allowing cancelling/termination, you can achieve any deterministic choice through terminating an existing coroutine and recreating it with the desired behaviour. So in effect, you do not actually need any other OOB message form.
So what we're actually looking at is whether there are types of OOB messages which occur frequently enough that recreating the process (and all of its logic state) is prohibitively expensive or simply difficult.
Quick numbers-game single-threaded (in my fastest configuration on my laptop (desktop is generally half these values)): a chsend
and chrecv
are 20 ns each and a roundtrip is 45 ns.
Recreating a coroutine after termination from chdone
is 100 ns when the logic state is negligible.
New benchmarks here: https://github.com/raedwulf/libdill/tree/choose-opt
Some contrived use-cases come to mind:
choose
is non-deterministic.In terms of order of magnitude, destroying/recreating a coroutine could work as usually time constraints are much higher than this. I think the main overhead is in both the amount of code and consideration on how to encode the state so that it can be moved between coroutines. Non-deterministic choose
could be used as well, but it is more likely to break soft-realtime constraints.
I think the whole point of priorities are that they are to deal with real-time constraints. It is difficult to make libdill hard real-time (being cooperatively scheduled) but I think adding priorities gives some credibility to being soft real-time which I guess is an interesting use case on its own (for games and interactive simulations anyway).
An instance where it would be inconvenient to recreate the state of a coroutine is when you have a separate library or function that controls its own state and it has been created within the coroutine as opposed to outside of the coroutine. Arguably you could create the state outside or even preserve it between coroutine hand-off but it's still probably a lot harder than simply receiving a message that alters the behaviour of the coroutine within the coroutine.
I think it boils down to a form of "locality". I create some object in library, so when the coroutine finishes, I destroy that object. With the cancellation approach, this is no longer true -- though it is not a technical issue, it is more a design headache.
What about this approach?
coroutine worker(int input, int oob) {
while(1) {
msg_t msg;
chrecv(input, &msg, sizeof(msg), -1);
process_input(msg);
int rc = chrecv(oob, &msg, sizeof(msg), 0);
if(rc == 0) process_oob(msg);
}
}
Yes, I think that would work provided that process
does not take too long such that the OOB was intended to affect msg
and alter the output for the receiver.
It's tricky to give any real examples as I doubt many existing programs use libmill/libdill for soft realtime purposes yet so whether there is a use case where it is unacceptable to be delayed by a single message is an open question. Having said that:
coroutine worker(int input, int oob) {
while(1) {
msg_t msg, oobmsg;
chrecv(input, &msg, sizeof(msg), -1);
int rc = chrecv(oob, &oobmsg, sizeof(msg), 0);
if(rc == 0) process_oob(oobmsg);
process(msg);
}
}
Would work too right?
This relies on the assumption that OOB messages are not particularly useful on their own without the real band data, so you'd only need to process OOB messages after receiving a message but before processing it.
I think we're onto something :)
Yes, that's clever.
Conceptually, while the above may look like a cheap trick to avoid using choose(), one should visualize coroutine as a linear sequence of steps (the goal is not to do scheduling by hand, after all). From that perspective it looks pretty reasonable: Get a message, change settings if needed, process message, repeat.
This, in effect, emulates a dual-priority system (this is with little analysis but a good gut feeling):
coroutine worker(int input, int oob) {
while(1) {
msg_t msg, oobmsg;
/* We block on data first, assume that OOB is only necessary before processing. */
chrecv(input, &msg, sizeof(msg), -1);
/* Process all OOB messages, emulates a higher-priority message. */
while(1) {
int rc = chrecv(oob, &oobmsg, sizeof(msg), 0);
if(rc != 0) break;
process_oob(oobmsg);
}
process(msg);
}
}
Whether more-than-two priorities have some use-cases that we haven't considered is an open question. I suspect having only one-data band/channel per coroutine (i.e. typical use case), or multiple homogenous data-bands/channel (i.e. multiplexers) means that we don't need more than one OOB channel.
@paulofaria what are your thoughts on this as you guys are using libmill in practice.
There's a patch to make choose
deterministic in deterministic-choose branch in case you want to review it.
The reasoning is as follows: In golang, clause is chosen at random to ensure fairness among channels and prevent any of them from being starved. While that makes sense if all the channels are of the same type (i.e. they are semantically equivalent) libdill's design is based on assumption that all semantically equivalent messages are passed via a single channel (possibly with multiple senders). Therefore, if choose
is used at all, the channels in question are likely to differ semantically. And if that's the case, prioritization based on type is preferable to using one type at random.
That looks good - I optimised the choose slightly by using VLAs as opposed to the reserved
field in chclause
. It optimises the use case of having few channels and allocating the clauses on the stack.
https://github.com/raedwulf/libdill/tree/choose-opt
That branch is based on the non-deterministic choose
but should be fairly trivial to adapt as reserved
is now only used in one location - you don't even need to cast.
Good idea. Done. If you want your micro-benchmarks included in the mainline, send me a pull request.
While we are dismantling the golang legacy: What's the use case for buffered channels?
Quick web search yields this, but none of the use cases sounds really convincing: http://stackoverflow.com/questions/15113410/when-to-use-a-buffered-channel
I can imagine if an event source chsend
in batches while a receiver consumes batches as well, buffers would be useful as an optimisation.
However due to the way libdill works by yielding at the end of every call, and also the fact the scheduler is RR, this usecase is inefficient.
I would say in this case (and possibly other cases) that the sender and receiver should be manually responsible for buffering on their own as opposed to being automatically buffered. And to deal with batching, each item can be a variable-sized? batch... perhaps this would be an argument for variable message sizes up to some maximum message size. Without buffering, this becomes trivially possible.
Buffered channels could be useful if there is some use-case where data must be buffered quickly otherwise is disappears or some other internal buffer overflows. I cannot think of any examples of this happening however (apart from a keyboard buffer which is obviously not a libdill usecase... yet). With soft real-time constraints probably good enough now, I think the user can implement buffering if they need to.
Are there any cases that the buffer will be mitigating deadlocks between different channels?
I think buffered channels can mask deadlocks, but not prevent them.
Yes that is true - hence the word mitigate :P I think there would be cases where it would solve a deadlock when it depends on a number of yields to resolve. This is probably bad practice as it is not easy analyse such behaviour. This would make such behaviour fail faster :)
The more I think about it, what we should have is large item buffers where a single item can be of any size up to that buffer. For example:
channel
has a 512 item buffer.chsend
sends 35 byte item.chrecv
receives 35 bytes and returns that information to the user.These are not partial reads (or streaming behaviour) as the item block is complete, as the user actually is encoding 7 items of 5 bytes each. I think this would be beneficial for the user to perform buffering (which does actually work as opposed to the current behaviour).
With unbuffered channels, data is copied directly from sender's buffer to receiver's buffer. Therefore, channel needs no storage and no fixed item size. Just saying.
Yeah that would work.
I think I'm describing partial reads/writes again (this time in the context of channels) but with the added guarantee that sender buffers/message sizes never exceed that of the receiver and that received data is complete.
Ultimately, I don't think what I'm suggesting is necessary - the time spent sending channel messages is unlikely to be the bottleneck in the application so an increase in throughput is unnecessary - unless you use a channel to send/receive large amounts of data. I think latency is more important for channels.
Ah okay. Github was not updating the comments for me earlier - yes that makes sense :)
One more thought experiment with channels.
Consider single message processor reading messages from a single channel written to by many clients. One would expect scheduler to be fair. In other words, client's experience should not be disproportionally affected by the presence of other clients. In yet another words, if you send a message once in a second and then a new client arrives able to produce million messages a second, your performance should drop to 1/2, not to 1/1000000.
With unbuffered channels this is trivially the case. Channel is effectively doing round-robin among the clients.
With 1000-item buffered channel the situation is different: Given that the other client sends so many messages, each time you want to send your message you'll encounter a full buffer. Thanks to round-robin you'll get a fair change to write to it, but there are still 999 items ahead of you in the line! That's bad news for your latency.
The takeaway is that queues introduce some non-trivial behaviour that may not be appropriate for a low-level synchronization primitive like channel. Maybe it would be better to have unbuffered channels and build queues on top of that?
Other thoughts:
channel<int>
in C, but that's the nature of the language. We can go in opposite direction and make channel transfer arbitrary chunks of data. If we do so, the real challenge is how to deal with buffer overflows (i.e. when chunk sent is larger than receiver's buffer).After thinking about this a bit too I think making channels unbuffered is the way to go.
My thoughts seem to align with 1 and 2. For 3, I'm thinking that channel can allow data up to some channel max item size. So the item length is sent as part of the buffer and len <= ch->itemsz
.
The semantics would be that the items must be complete as far as libdill is aware - it is for cases where it is unnecessary to copy the max item size each time and read it....
Buffer overflows will effectively mean that there is an item waiting in the receiver and the sender is blocked waiting on that. This may require sort of yield spin (this is what buffering avoids I guess).
Backtracking slightly maybe we should forget for the moment the fact channels can transmit data at all (and forget about the arbitrary sizes) and think about channels as a means of transmitting control signals. Data can always be transmitted through different means of we use channels for synchronisation.
Buffered control signals have your stated issues but do reduce the effect if the busy yield problem (when the receiver is full but you need to synchronise). I'm on my phone at the moment but it should be possible for the receiver to wake up the sender directly after receiving the message.
EDIT: Maybe I am getting confused in my own thoughts as there's some ambiguous terminology which means we're crossing wires quite a lot, so I'll list the types of channel use we have in the next comment.
Use cases:
Implementation
Questions:
Edit: Go channels seem to face more problems than we do because Go has the problem of being run in multiple kernel threads so thread-safety becomes more difficult than it does for us: http://www.jtolds.com/writing/2016/03/go-channels-are-bad-and-you-should-feel-bad/
His solution with mutexes is already implicitly provided by libdill by the fact everything is mutexed between yield points in a single-threaded cooperatively scheduled coroutine.
To add to 3 and 4, I'm not advocating the use of buffered channels but the fact that they exist and in libmill, libdill and Go as opposed to using unbuffered channels to begin with means that we've been using one set of semantics all the time. Removing that set of semantics may cause deadlock problems to crop up so how can we solve it when the channel is unbuffered (if buffered channels were the solution to begin with).
One thought about implementing your own buffer on top of unbuffered channels would be to try and chsend
and detect failure, then store into a local buffer and then continue your loop.
With discussion with @paulofaria earlier about an example of buffered channels (using Venice which in turn uses libmill):
https://github.com/Zewo/Venice/blob/master/README.md#10---iterating-over-channels
We considered what it would mean in the context of an unbuffered channel, in particular the statement: "This example also showed that it’s possible to close a non-empty channel but still have the remaining values be received."
My reply was...
Designing with a unbuffered channel depends on whether (in context of the sender/receiver):
Two solutions that come to mind with unbuffered channels:
Coroutines are cheap, very cheap (in the context of libdill)... they are as quick to launch as sending a message. They differ in only a few percent or in the worst case they're still in the same order of magnitude So another way of thinking about it is that the buffer (copying) can be described as a lightweight process (the memcpy essentially). The only difference is that the buffer stores the results somewhere intermediate and a second process (the receiver) starts reading it afterwards.
Edit: to add a bit of context for non-libdill users, a single libdill context is single-threaded so we don't have concerns where channels are used on parallel execution (real) threads.
I'd love to hear what @robpike has to say about this. I have a feeling we can do away with buffered channels as I think there's nothing we can't do without them. I just wonder about the complexity of a solution without buffered channels. something like the example on the link:
https://github.com/Zewo/Venice/blob/master/README.md#10---iterating-over-channels
How would we do it without unbuffered channels?
Sorry for repeating the link, haha. I posted before refreshing the page. @raedwulf was faster.
So another way of thinking about it is that the buffer (copying) can be described as a lightweight process (the memcpy essentially). The only difference is that the buffer stores the results somewhere intermediate and a second process (the receiver) starts reading it afterwards.
I like that. What it means for the end user is that instead of having a single channel object that can be either buffered or unbuffered they would have two objects: channel (unbuffered) and queue (buffered, implemented as a coroutine):
int ch = channel(sizeof(int));
int q = queue(sizeof(int), 1000);
One benefit of this model would be that queues could be somewhat more heavy-weight. People may want stuff like auto-growing queues, queue size reporting etc.
Another benefit: With fixed-size channel object it is possible to allocate it on stack in scenarios where performance is paramount:
channel_storage_t s;
int ch = channel_stack(&s);
Interesting tangential observation: Implementation of queue is an actual use case for choose() statement.
As for the buffers passed via channels, I am inclined to keep the fixed item size (a.k.a. poor-man's type system). The reason is that otherwise we get a usability problem. Each call to chrecv
would have to look like this:
size_t sz = chrev(ch, &val, sizeof(val));
if(sz < 0) do_error_handling();
if(sz != sizeof(val)) do_error_handling();
Having one line to check the result is enough for a lot of people to hate C. Having two checks makes it even worse.
That being said, if the user wants to pass arbitrarily sized buffer, they can just pass a pointer.
- Are there design patterns involving small cycles of resource contention which a buffer larger than the cycle would solve? Is there a way to solve it without using buffered channels? I'm unfamiliar with cooperative synchronisation semantics, so I don't know if this is an existing paradigm or even possible. Perhaps if someone can produce an example or counter-example of this, it would be very useful!
That's the semaphore use case. You can devise synchronization mechanisms where invariant is, for exampe, "there are at most 2 messages in the queue". These are pretty rare though.
That being said, if the user wants to pass arbitrarily sized buffer, they can just pass a pointer.
Yes, this would be the building block for queue
.
Having one line to check the result is enough for a lot of people to hate C. Having two checks makes it even worse.
I find that this makes developing error-free code so much easier (local information of what function's error you are handling)! Exception-handling tempts one to be sloppy in handling specific errors because catch-alls are so easy to implement.
Are you planning to add queue
within libdill or this user-implemented functionality? If it is going to be added to libdill I'll keep this open, if not I guess that this issue can be closed.
Two options here.
Reproduced from
libdill.h
for discussion:Proposed API for message priorities (separate functions):
Second proposed API. Replace existing API, but then requires additional parameters which will break existing API.
Any thoughts?