nanomsg / mangos

mangos is a pure Golang implementation of nanomsg's "Scalablilty Protocols"
Apache License 2.0
682 stars 79 forks source link

Statistics desired #76

Open gdamore opened 9 years ago

gdamore commented 9 years ago

It may be nice to have methods to track statistics. There are probably a great number of potentially useful stats. We need to think about them.

nkev commented 8 years ago

+1 I also cannot find any benchmarks compared to nanomsg or any other messaging platform.

gdamore commented 8 years ago

Just a note @nkev -- we do have the local_lat, remote_lat, local_thr, and remote_thr utilities that nanomsg has. You can run them against each other -- or you can run them against nanomsg. (There are four permutations for each of these .) As far as just benchmarking locally, you can also use the Go benchmark facility in the test subdirectory. There are throughput and latency tests there. Obviously you cannot really compare the results against anything else, but it may still be useful. (For evaluating local changes to code and their impact on performance for example.)

eelcocramer commented 7 years ago

I've added mangos to an existing message benchmark that I stumbled upon. It compares benchmarks of python and golang messaging solution. For golang it compares zmq, redis and now mangos. You can find my code here:

https://github.com/eelcocramer/two-queues

I'm not sure if a made it as efficient as possible but currently mangos does not come even close to zmq and redis.

update removed the graph as the figures are reliable.

Any hints to improve the performance of my code?

gdamore commented 7 years ago

Looking at this briefly, you’re not making use of the mangos.Message structure, so you’re hitting the garbage collection logic really hard. You don’t suffer this penalty for the native zmq and redis code. I haven’t really done much more analysis than that. The libzmq and redis versions are written in C, and don’t make use of dynamic data at all, which gives them excellent performance. It might also be worth playing with the write/read QLengths.

Otherwise I need to spend more time looking at this — certainly I get the same terrible results, and I haven’t had time yet to fully digest this. One other thing that I suspect might be hurting us is go’s default TCP settings.

I would also avoid changing the runtime’s choice for GOMAXPROCS… the modern Go runtime does a good job of selecting this automatically.

One thing is that I’m seeing that you get much smaller numbers of messages pushed through than I do using the performance test suite (local_thr and remote_thr). It might be interesting to compare this to ZMQ — I think ZMQ has similar test suites. That said, it looks like mangos doesn’t come close to ever achieving the redis numbers.

Probably it would be worth profiling this code.

On Tue, Feb 7, 2017 at 6:47 AM, Eelco notifications@github.com wrote:

I've added mangos to an existing message benchmark http://blog.jupo.org/2013/02/23/a-tale-of-two-queues/ that I found online. It compares benchmarks of python and golang messaging solution. For golang it compares zmq, redis and now mangos. You can find the code here:

https://github.com/eelcocramer/two-queues

I'm not sure if a made it as efficient as possible but currently mangos does not come even close to zmq and redis.

[image: two-queues-5] https://cloud.githubusercontent.com/assets/795579/22695938/c5942fe0-ed4c-11e6-816c-7ce4fb773ce4.png

Any hints to improve the performance of my code?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/go-mangos/mangos/issues/67#issuecomment-278020829, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPDfQkjXV1P-OMnuGmh24hKUM8oGyk8ks5raIQfgaJpZM4Dbb1W .

eelcocramer commented 7 years ago

Thank you for the response and the solution pointers. I'll dig into this and post results back.

I will remove the graph because it probably does not good reliable information.

gdamore commented 7 years ago

I just did a little more investigation.

ZeroMQ (and possibly Redis) are queuing messages, and do not exert backpressure by default. (The default high water mark in ZEROMQ is 0.) This means that you can send as fast as you can, and you’re limited only by the rate that you can allocate message buffers.

With mangos, we have a situation where there is backpressure involved — the default is to only queue up to 128 messages before exerting backpressure. Experimentally this is found to be a good limit for most cases. In the push/pull case we see backpressure exerted, and the push routine backups. In zmq’s case we just spin hard. Pub/Sub is a bit harder, because that’s best effort delivery, so we actually drop messages if we try to send too fast. Benchmarking pub/sub is harder as a result, since messages will be dropped when you push too hard. (Conversely, ZMQ doesn’t drop, but your program’s heap usage may grow without bound. I consider this unacceptable in a production setting, and utterly useless from a benchmarking perspective.)

The problem is that I’m not convinced, having looked at the code, that we’re actually measuring message throughput at all; we may just be measuring how fast certain routines can be called in a loop. In other words, I think the benchmark collection methods are flawed.

For a pub/sub benchmark, I’d actually use a pair of peers, pubA->subB->pubB->subA — basically forwarding the message back. You could use this to measure both latency and single thread msgs/sec. As you increase the number of concurrent clients you will hit a threshold where message drops occur; certainly at 128 (the default queue depth) the clients should outpace the server and you should see drops. You may see them faster than that depending on how fast we can pull messages from the underlying transport. I haven’t done any concrete experimentation yet.

On Tue, Feb 7, 2017 at 9:38 AM, Garrett D'Amore garrett@damore.org wrote:

Looking at this briefly, you’re not making use of the mangos.Message structure, so you’re hitting the garbage collection logic really hard. You don’t suffer this penalty for the native zmq and redis code. I haven’t really done much more analysis than that. The libzmq and redis versions are written in C, and don’t make use of dynamic data at all, which gives them excellent performance. It might also be worth playing with the write/read QLengths.

Otherwise I need to spend more time looking at this — certainly I get the same terrible results, and I haven’t had time yet to fully digest this. One other thing that I suspect might be hurting us is go’s default TCP settings.

I would also avoid changing the runtime’s choice for GOMAXPROCS… the modern Go runtime does a good job of selecting this automatically.

One thing is that I’m seeing that you get much smaller numbers of messages pushed through than I do using the performance test suite (local_thr and remote_thr). It might be interesting to compare this to ZMQ — I think ZMQ has similar test suites. That said, it looks like mangos doesn’t come close to ever achieving the redis numbers.

Probably it would be worth profiling this code.

On Tue, Feb 7, 2017 at 6:47 AM, Eelco notifications@github.com wrote:

I've added mangos to an existing message benchmark http://blog.jupo.org/2013/02/23/a-tale-of-two-queues/ that I found online. It compares benchmarks of python and golang messaging solution. For golang it compares zmq, redis and now mangos. You can find the code here:

https://github.com/eelcocramer/two-queues

I'm not sure if a made it as efficient as possible but currently mangos does not come even close to zmq and redis.

[image: two-queues-5] https://cloud.githubusercontent.com/assets/795579/22695938/c5942fe0-ed4c-11e6-816c-7ce4fb773ce4.png

Any hints to improve the performance of my code?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/go-mangos/mangos/issues/67#issuecomment-278020829, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPDfQkjXV1P-OMnuGmh24hKUM8oGyk8ks5raIQfgaJpZM4Dbb1W .

eelcocramer commented 7 years ago

Thanks for the extensive answer.

gdamore commented 7 years ago

So I made a test that measures pub/sub round trips. TCP is the limiter.

With inproc using mangos, I can get about 720k rtt/sec with 8 clients, and 760k rtt/sec with 16 clients. Adding more clients helps. With just a single client using inproc I get 128k rtt/sec. This is a serial send/recv loop, really measuring round trips. (A publishes to B, and B publishes back to A; A and B each use two sockets.) I’m using mangos.NewMessage() to avoid the garbage collector in Go. This is also on my 2013 iMac, using go 1.7.1, and no adjustments to GOMAXPROCS. It seems that diminishing routines hit somewhere around 16 concurrent clients. Note that I’m not experiencing any losses, and I’m using no special tuning options in mangos.

With TCP, using a single client, I get 18k rtt/sec, with 16 clients its about 43k rtt/sec. This is over loopback on my mac. That corresponds to 86k messages per second. A bit lower than I’d like, but again this is completely untuned. (Interestingly enough, the code stalls at around 44k rtt/sec at 32 clients.) Bumping up GOMAXPROCS seems to help further, getting me to about 57k rtt/sec. (114k msgs/sec.)

There are some enhancements that can be made for sure to improve the rtt performance. For example, the messages are sent using two Write() operations, leading to two TCP segments per message. This is really bad for performance, and needs to be fixed. I probably should take some effort to make sure that TCP nodelay is set. I think that’s missing at present.

I’m fairly confident that your test results are skewed due to incorrect measurements.

The fact that the low client count results in extraordinarily high messages/secs really does tell me quite clearly that you’re not measuring actual transit times. Indeed, mangos performs better as you increase the client count, since you wind up shoveling more data and having threads spend less time waiting (parallelization wins). The fact that this is not true for redis or zmq tells me that either they have horribly broken concurrency (they don’t), or your test methodology is busted.

I’m not prepared to go into detail about the test quality, but here’s the code I used for mangos validation with pub/sub. Feel free to adjust as you like. You can change the addresses, and the clients and loop variable. You do have to add the various client results together, but what I did was just take the middle results and multiply by the client count. Not extremely precise, but I think its pretty close to accurate especially for sufficiently long runs.

On Tue, Feb 7, 2017 at 10:33 AM, Garrett D'Amore garrett@damore.org wrote:

I just did a little more investigation.

ZeroMQ (and possibly Redis) are queuing messages, and do not exert backpressure by default. (The default high water mark in ZEROMQ is 0.) This means that you can send as fast as you can, and you’re limited only by the rate that you can allocate message buffers.

With mangos, we have a situation where there is backpressure involved — the default is to only queue up to 128 messages before exerting backpressure. Experimentally this is found to be a good limit for most cases. In the push/pull case we see backpressure exerted, and the push routine backups. In zmq’s case we just spin hard. Pub/Sub is a bit harder, because that’s best effort delivery, so we actually drop messages if we try to send too fast. Benchmarking pub/sub is harder as a result, since messages will be dropped when you push too hard. (Conversely, ZMQ doesn’t drop, but your program’s heap usage may grow without bound. I consider this unacceptable in a production setting, and utterly useless from a benchmarking perspective.)

The problem is that I’m not convinced, having looked at the code, that we’re actually measuring message throughput at all; we may just be measuring how fast certain routines can be called in a loop. In other words, I think the benchmark collection methods are flawed.

For a pub/sub benchmark, I’d actually use a pair of peers, pubA->subB->pubB->subA — basically forwarding the message back. You could use this to measure both latency and single thread msgs/sec. As you increase the number of concurrent clients you will hit a threshold where message drops occur; certainly at 128 (the default queue depth) the clients should outpace the server and you should see drops. You may see them faster than that depending on how fast we can pull messages from the underlying transport. I haven’t done any concrete experimentation yet.

On Tue, Feb 7, 2017 at 9:38 AM, Garrett D'Amore garrett@damore.org wrote:

Looking at this briefly, you’re not making use of the mangos.Message structure, so you’re hitting the garbage collection logic really hard. You don’t suffer this penalty for the native zmq and redis code. I haven’t really done much more analysis than that. The libzmq and redis versions are written in C, and don’t make use of dynamic data at all, which gives them excellent performance. It might also be worth playing with the write/read QLengths.

Otherwise I need to spend more time looking at this — certainly I get the same terrible results, and I haven’t had time yet to fully digest this. One other thing that I suspect might be hurting us is go’s default TCP settings.

I would also avoid changing the runtime’s choice for GOMAXPROCS… the modern Go runtime does a good job of selecting this automatically.

One thing is that I’m seeing that you get much smaller numbers of messages pushed through than I do using the performance test suite (local_thr and remote_thr). It might be interesting to compare this to ZMQ — I think ZMQ has similar test suites. That said, it looks like mangos doesn’t come close to ever achieving the redis numbers.

Probably it would be worth profiling this code.

On Tue, Feb 7, 2017 at 6:47 AM, Eelco notifications@github.com wrote:

I've added mangos to an existing message benchmark http://blog.jupo.org/2013/02/23/a-tale-of-two-queues/ that I found online. It compares benchmarks of python and golang messaging solution. For golang it compares zmq, redis and now mangos. You can find the code here:

https://github.com/eelcocramer/two-queues

I'm not sure if a made it as efficient as possible but currently mangos does not come even close to zmq and redis.

[image: two-queues-5] https://cloud.githubusercontent.com/assets/795579/22695938/c5942fe0-ed4c-11e6-816c-7ce4fb773ce4.png

Any hints to improve the performance of my code?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/go-mangos/mangos/issues/67#issuecomment-278020829, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPDfQkjXV1P-OMnuGmh24hKUM8oGyk8ks5raIQfgaJpZM4Dbb1W .

gdamore commented 7 years ago
package main

import (
    "fmt"
    "sync"
    "time"

    "github.com/go-mangos/mangos"
    "github.com/go-mangos/mangos/protocol/pub"
    "github.com/go-mangos/mangos/protocol/sub"
    "github.com/go-mangos/mangos/transport/tcp"
    "github.com/go-mangos/mangos/transport/inproc"
)

var addr1 = "tcp://127.0.0.1:4455"
var addr2 = "tcp://127.0.0.1:4456"
//var addr1 = "inproc://127.0.0.1:4455"
//var addr2 = "inproc://127.0.0.1:4456"

func client(loops int) {
    p, e := pub.NewSocket()
    if e != nil {
        panic(e.Error())
    }
    defer p.Close()
    s, e := sub.NewSocket()
    if e != nil {
        panic(e.Error())
    }
    defer s.Close()

    p.AddTransport(tcp.NewTransport())
    s.AddTransport(tcp.NewTransport())
    p.AddTransport(inproc.NewTransport())
    s.AddTransport(inproc.NewTransport())

    s.SetOption(mangos.OptionSubscribe, []byte{})

    if e = p.Dial(addr1); e != nil {
        panic(e.Error())
    }
    if e = s.Dial(addr2); e != nil {
        panic(e.Error())
    }

    msg := mangos.NewMessage(8)
    msg.Body = append(msg.Body, []byte("hello")...)
    now := time.Now()

    time.Sleep(time.Millisecond * 100)

    for i := 0; i < loops; i++ {
        if e = p.SendMsg(msg); e != nil {
            panic(e.Error())
        }
        if msg, e = s.RecvMsg(); e != nil {
            panic(e.Error())
        }
    }
    end := time.Now()
    delta := float64(end.Sub(now))/float64(time.Second)

    fmt.Printf("Client %d RTTs in %f secs (%f rtt/sec)\n",
        loops, delta, float64(loops)/delta);
}

func server() {
    p, e := pub.NewSocket()
    if e != nil {
        panic(e.Error())
    }
    defer p.Close()
    s, e := sub.NewSocket()
    if e != nil {
        panic(e.Error())
    }
    defer s.Close()
    p.AddTransport(tcp.NewTransport())
    s.AddTransport(tcp.NewTransport())
    s.AddTransport(inproc.NewTransport())
    p.AddTransport(inproc.NewTransport())
    s.SetOption(mangos.OptionSubscribe, []byte{})

    s.Listen(addr1)
    p.Listen(addr2)

    for {
        msg, e := s.RecvMsg()
        if e != nil {
            println(e.Error())
            return
        }
        e = p.SendMsg(msg)
        if e != nil {
            println(e.Error())
            return
        }
    }
}

func main() {
    clients := 32
    loops := 10000

    go server()
    time.Sleep(time.Millisecond*100)

    wg := sync.WaitGroup{}
    wg.Add(clients)

    for i := 0; i < clients; i++ {
        go func() {
            defer wg.Done()
            client(loops)
        }()
    }

    wg.Wait()
}
gdamore commented 7 years ago

I see that I actually added a 100 ms delay inside the initial timing, so my numbers are 100 ms worse than they should be. But for large message counts that should amortize into the noise. Still should fix that....

gdamore commented 7 years ago

The sleep is there to ensure that the connections are established. It can take several dozen milliseconds for TCP connections and the go routines on both sides to establish.

gdamore commented 6 years ago

statistics support will be a mangos v2 deliverable. Probably following along with NNG in that regard.

BHare1985 commented 7 months ago

Did this every make v2, if not, is there something on the roadmap? I am curious about using mangos and writing my own pub/sub system in order to reduce code complexity and maintainability with KISS but would be interested to know how it benchmarks/performs with very little helper code and design around throughput

gdamore commented 7 months ago

It did not. Problem here is too many projects and not enough time.

If this is important enough to you to sponsor (or to contribute code) let me know.