Queueing and buffering for an EndPoint

avieth commented 7 years ago

Abstracted the use of Chan/readChan/writeChan so that we can give more complicated queueing and buffering policies.

qnikst commented 7 years ago

I like this approach much more that block/unblock one, this one is pretty consistent, and questions about is semantics is still sound or not in corner cases do not arise. I'd like to merge that, though it would be nice to have more benchmark data to see if there is a measurable performance hit and could we do better. I don't see anything that could be improved by reviewing the patch, but raw numbers will speak better.

qnikst commented 7 years ago

I did small benchmarking (really basic benchmark that was in n-t-tcp codebase, simple ping-pong server and client):

this branch:

        Command being timed: "./JustPingTransport 100000"
        User time (seconds): 4.35
        System time (seconds): 3.53
        Percent of CPU this job got: 80%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.81
        Maximum resident set size (kbytes): 6524
        Minor (reclaiming a frame) page faults: 601
        Voluntary context switches: 1
        Involuntary context switches: 1551543

versus master:

        Command being timed: "./JustPingTransport 100000"
        User time (seconds): 3.65
        System time (seconds): 3.59
        Percent of CPU this job got: 79%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.11
        Maximum resident set size (kbytes): 5532
        Minor (reclaiming a frame) page faults: 434
        Voluntary context switches: 6
        Involuntary context switches: 1552027

Looks like ~20% penalty for this functionality (If I'm calculating correctly), and additional time is from user, rather than kernel space. @avieth should this functionality have such cost, and are there better benchmarks, because default one is may be not best fit? unfortunately we don't have good benchmark framework available to measure different scenarios.

avieth commented 7 years ago

@qnikst Could you send me the source for the benchmark program?

qnikst commented 7 years ago

https://github.com/haskell-distributed/network-transport-tcp/blob/master/benchmarks/JustPingTransport.hs

avieth commented 7 years ago

@qnikst I suspect that more complicated queueing policies will always come with a decrease in maximal throughput, but hopefully with an increase in average throughput in a real network scenario. The unbounded Chan approach that we use in master means minimal contention between the receiveing thread and those threads which are producing events, but with my STM and priority queue approach there's bound to be a lot more.

So I've gone ahead and generalized queueing so that even the old unbounded Chan approach can be expressed. Performance remains the same when that one is chosen, but users can come up with their own policies and use them via newEndPointInternal. I just need to extend the long comment on QDisc to make sure it's clear exactly how it must behave. Incidentally, this gives a place where Reliability can be used (AFAICT nt-tcp just treats everything as if it were reliable and ordered, but a weaker policy would allow for more queueing styles).

qnikst commented 7 years ago

@avieth right, I have no problems if we have to pay for more flexible approaches, as as far as I can see your approach may improve situations that are bad in old approach. I was just surprised with the overhead, so wanted to verify if it was expected or not. I like ability to have support of the Unreliable connections, that may drop data if queue is overflowed, sorry I completely forgot to mention that. CCI does exactly that for unreliable connection over tcp.

dcoutts commented 7 years ago

I've been away and @avieth and I have just now had time to review this together. We think we can revise this patch to simplify it somewhat, which should make it easier to review and merge.

I think we don't actually need the QMetadata, though I've not yet double checked this with @avieth. That should reduce the overhead slightly. I think we only need to abstract over the use of readChan/writeChan, to be able to replace an unbounded queue with a bounded one.

We also think it makes sense to simplify the API by making the QDisc a part of the initialisation params for the transport as a whole, rather than trying to make an extended internal interface for the tcp backend. This would mean every endpoint for the transport would use the same QDisc, which is probably ok for most use cases (it is for ours).

Finally, we could provide more QDisc implementations for other people to use rather than just the unbounded queue. It would be trivial to provide a one-place bounded queue (ie a single MVar). We probably don't want to provide a bounded queue since that'd involve adding a dep on the stm package.

We'll try re-running benchmarks with this simplified implementation to verify it does not affect performance.

So expect an update...

avieth commented 7 years ago

Finally, we could provide more QDisc implementations for other people to use rather than just the unbounded queue. It would be trivial to provide a one-place bounded queue (ie a single MVar).

Such a QDisc causes some tests to deadlock. Fix is here: https://github.com/haskell-distributed/network-transport-tests/pull/10

I think we don't actually need the QMetadata, though I've not yet double checked this with @avieth. That should reduce the overhead slightly. I think we only need to abstract over the use of readChan/writeChan, to be able to replace an unbounded queue with a bounded one.

Well, I can imagine that a more complicated QDiscs would make decisions based on, for instance, how many connections are opened to a given peer, or how many bytes have been received on a connection. In that case, it's important to have some metadata.

We also think it makes sense to simplify the API by making the QDisc a part of the initialisation params for the transport as a whole, rather than trying to make an extended internal interface for the tcp backend. This would mean every endpoint for the transport would use the same QDisc, which is probably ok for most use cases (it is for ours).

Yes so this is now visible in the changeset (I've pushed a few new commits). There is tcpQDisc :: forall t . IO (QDisc t) in the TCPParameters. It's in IO because we want to be able to create mutable data structures locally for each EndPoint created. I like the relative simplicity of this, but I wonder how it will work in the case of a more complicated QDisc that uses some time-varying controls for queueing. Surely the user will want to be able to control it per-EndPoint rather than for every EndPoint, don't you think? If so, it'll be cumbersome. If the QDisc is made per-EndPoint (as a parameter to newEndPoint) then the user can get their hands on the controls directly, but as it stands now they would have to use some indirection like a map keyed on EndPointAddress.

edsko commented 7 years ago

LGTM;

Probably don't need the QMetadata abstraction, as it's not entirely clear that it's the canonical one.
I'd keep this in network-transport-tcp for now and move it to network-transport later, if at all.

avieth commented 7 years ago

The kill test fails (testKill in network-transport-tests) as of the latest commit 6636e5689fa8c67393d4796ee3a3a80747e9e6f5, and only on GHC 7.6.3. Very strange. That commit just changes names and couldn't possibly have introduced a regression. I suspect there's something wrong with this test.

qnikst commented 7 years ago

Tests could be bad, we have experienced and fixed a number of problems with test last year. But problems may still be there.

facundominguez commented 7 years ago

@avieth please review my changes. If they are ok, I think this can be merged.

avieth commented 7 years ago

please review my changes. If they are ok, I think this can be merged.

LGTM

haskell-distributed / network-transport-tcp

Queueing and buffering for an EndPoint #46