Improvements - Githubissues

funny-falcon commented 9 years ago

use non-blocking select from single channel when possible such select doesn't need heap allocation nor complex logic
stop timer in client CallTimeout
detect empty requestsChan and responsesChan instead of timer to flush writers with default PendingRequests == PendingResponses == 32768 it is good both for throughput and latency (but i'll recomend to decrease default write buffers a bit).

valyala commented 9 years ago

Thanks for the participation, but unfortunately I cannot merge these changes due to problems described above.

funny-falcon commented 9 years ago

It's a pity that you even measured proposed changes and their impact. I share not just my "thoughts" but experience, and you talk about it as if it is "premature optimizations". Yeah, it is "premature optimizations" if you never plan to reach 100000rps and more. But it is profile guided optimization at that rate, cause i did profile at that rate.

valyala commented 9 years ago

I just benchmarked blocking select with non-blocking select and was impressed by the numbers:

func makeChans() (bCh chan struct{}, nbCh chan struct{}) {
        nbCh = make(chan struct{})
        close(nbCh)
        bCh = make(chan struct{})
        return
}

func BenchmarkBlockingSelect(b *testing.B) {
        bCh, nbCh := makeChans()

        b.ResetTimer()
        for i := 0; i < b.N; i++ {
                select {
                case <-nbCh:  // query queue emulation
                case <-bCh:   // timer emulation
                }
        }
}

func BenchmarkNonBlockingSelect(b *testing.B) {
        bCh, nbCh := makeChans()

        b.ResetTimer()
        for i := 0; i < b.N; i++ {
                select {
                case <-nbCh:
                default:
                        b.Fatalf("unexpected code path")
                        select {
                        case <-nbCh:
                        case <-bCh:
                        }
                }
        }
}

Results:

BenchmarkBlockingSelect  5000000           256 ns/op
BenchmarkNonBlockingSelect  30000000            43.2 ns/op

This proves that your change with select increases its' performance by more than 6 times! The change has been adopted in the code - see https://github.com/valyala/gorpc/commit/ec896817a62b8d935de640ecceae7db2560c9b75 .

valyala commented 9 years ago

FYI, I slightly updated benchmark code to be closer to the reality:

func makeChans(n int) (bCh chan struct{}, nbCh chan struct{}) {
        nbCh = make(chan struct{}, n)
        bCh = make(chan struct{})
        return
}

func BenchmarkBlockingSelect(b *testing.B) {
        bCh, nbCh := makeChans(b.N)

        b.ResetTimer()
        for i := 0; i < b.N; i++ {
                select {
                case nbCh <- struct{}{}: // query queue emulation
                case <-bCh: // timer emulation
                }
        }
}

func BenchmarkNonBlockingSelect(b *testing.B) {
        bCh, nbCh := makeChans(b.N)

        b.ResetTimer()
        for i := 0; i < b.N; i++ {
                select {
                case nbCh <- struct{}{}:
                default:
                        b.Fatalf("Unexpected code path")
                        select {
                        case nbCh <- struct{}{}:
                        case <-bCh:
                        }
                }
        }
}

But this didn't change benchmark results.

valyala commented 9 years ago

As to the change with time.After() -> time.NewTimer(), benchark results prove this was worthless optimization:

func BenchmarkTimerStop(b *testing.B) {
        for i := 0; i < b.N; i++ {
                t := time.NewTimer(time.Millisecond)
                select {
                case <-t.C:
                default:
                        t.Stop()
                }
        }
}

func BenchmarkTimerNoStop(b *testing.B) {
        for i := 0; i < b.N; i++ {
                tc := time.After(time.Millisecond)
                select {
                case <-tc:
                default:
                }
        }
}

Results:

BenchmarkTimerStop-4     1000000          1483 ns/op         192 B/op          3 allocs/op
BenchmarkTimerNoStop-4   2000000           854 ns/op         192 B/op          3 allocs/op

funny-falcon commented 9 years ago

Yeah, you are right about timers: I benched it when go were 1.1, and then each timer expiration created goroutines. When I asked for a way to call function within timers goroutine, they refuse it. And now they call timers function in a timers goroutine.... just as i asked, and they refused...

valyala commented 9 years ago

FYI, I added an ability to disable message buffering on both client and server for those who need minimal rpc latency. But benchark results say that gorpc has better throughput with enabled message buffering:

10K workers, default 5ms delay for message buffering:

BenchmarkEchoInt10000Workers-4    100000         10849 ns/op
BenchmarkEchoIntNocompress10000Workers-4      200000         10088 ns/op
BenchmarkEchoString10000Workers-4     100000         11528 ns/op
BenchmarkEchoStringNocompress10000Workers-4   100000         11218 ns/op
BenchmarkEchoStruct10000Workers-4     100000         14669 ns/op
BenchmarkEchoStructNocompress10000Workers-4   100000         13533 ns/op

10K workers, disabled message buffering:

BenchmarkEchoInt10000Workers-4     50000         27190 ns/op
BenchmarkEchoIntNocompress10000Workers-4      100000         14499 ns/op
BenchmarkEchoString10000Workers-4      50000         29251 ns/op
BenchmarkEchoStringNocompress10000Workers-4   100000         16329 ns/op
BenchmarkEchoStruct10000Workers-4      50000         32244 ns/op
BenchmarkEchoStructNocompress10000Workers-4   100000         19387 ns/op

funny-falcon commented 9 years ago

your code is not equivalent to my proposal. I'll try to fix it and post bench soon. My proposal didn't disable buffering, it just makes buffering low latency.

funny-falcon commented 9 years ago

Made a separate pool request #2 with benching

valyala / gorpc

Improvements #1