proposal: net: TCPConn supports Writev

winlinvip commented 8 years ago

I have search go-nuts and google about the writev for unix, seems go not support it yet. And I found a project vectorio which support writev but not work.

I am rewriting the srs to go-oryx. SRS can support 7.5k clients per CPU, while golang version only support 2k, for the media streaming server need to delivery the video or audio packet to different parts, then use writev to send to client to avoid bytes copy.

I try to use reflect to implements the writev, but I found it's possible for the pollDesc is not exported, the commit is here.

Does go plan to support writev(netFD.Writev)?

winlinvip commented 8 years ago

I want to explain why writev is important for media streaming server:

The media streaming server always delivery some audio or video packets to a connection.
These audio or video packets can be sent in a writev to avoid too many syscall.
The writev can used to avoid the copy from videos to the buffer, for the video payload is very large.

I have use cpu and mem profile to fix the bottleneck, but the memory copy and syscall by use netFD.Write hurts the performance.

Terry-Mao commented 8 years ago

if has writev, it's easy for us to implement zero copy service....

Terry-Mao commented 8 years ago

@ianlancetaylor As a network proxy, many network protocol has TWO parts HEADER & DATA part. if netFD has writev and also support by bufio, it's no need do like:

// reader & writer goroutine call write concurrency, must lock.
lock.Lock()
w.Write(head)
w.Write(data)
lock.Unlock()

but the inner netFD has fd.writeLock(), so caller must do a wasting lock.Lock(). something like this is very nice:

// Write is atomic, and goroutine safe
w.Write(head, data)

func (w *Writer) Write(data ...[]byte) (n int, err error)

winlinvip commented 8 years ago

For streaming service, for example, to send a video packet to RTMP(over TCP) client, the video maybe:

video=make([]byte, 156 * 1024)

Then we need to add some small header every some video bytes, for instance, 10k:

h0=make([]byte, 12)
p0 = video[0:10240]

h1=make([]byte, 5)
p1=video(10240:20480)

......

// util the hN and pN
hN=make([]byte, 5)
pN=video(x:)

RIght now, we send data in a very very slow way:

// merge all to a big buffer
bigBuffer = bytes.Buffer{}
for b in (h0, p0, h1, p1, ..., hN, pN) {
    bigBuffer.Write(b)
}

// send in a syscall
netFD.Write(bigBuffer)

Because the syscall is expensive than copy buffer:

// very very slow for too many syscall
for b in (h0, p0, h1, p1, ..., hN, pN) {
    netFD.Write(b)
}

When golang support writev, we can send in a syscall and without copy to a big buffer.

// high effiency writev for stream server.
Write(h0, p0, h1, p1, ......, hN, pN)

davecheney commented 8 years ago

// merge all to a big buffer
bigBuffer = bytes.Buffer{}
for b in (h0, p0, h1, p1, ..., hN, pN) {
    bigBuffer.Write(b)
}

h0, p0, etc all derive from a single []byte slice, why do you slice them up, then copy them back together? Sending the original slice would avoid the copy, and avoid extending the net.Conn interface.

winlinvip commented 8 years ago

@davecheney Nop, only the p0,p1,...,pN is slice from video, the h0,h1,...,hN is another slice. That is, we slice the video payload:

// it's ok to use slice, without copy.
video = p0 + p1 + ... + pN

But the headers is another slice:

headers = h0 + h1 + ... + hN

We should interleave the header and payload to a big buffer:

big-buffer = h0 + p0 + h1 + p1 + ... + hN + pN

Because each header is belong to its payload, for instance, hN is only for pN. When use c/c++ to write to socket, we can use writev:

iovec iovs[N*2];
for h,p in (h0, p0, ..., hN, pN) {
    iovs[i].data = h
    iovs[i+1].data = p
}
socket.writev(iovs)

Does that make sense?

ianlancetaylor commented 8 years ago

I'm fine with adding a Writev call to the net package. I'm not aware of anybody actually working on it.

winlinvip commented 8 years ago

@ianlancetaylor Great~

I create SRS, and rewrite it by golang go-oryx, and the benchmark tool srs-bench for streaming server.

SRS can serve 7.5K clients, use 1CPU(about 80% usage); the bandwidth is about 3.75Gbps. SRS is really very high efficient for use writev to avoid copy and use little syscall. But SRS is single process model, for steaming server is too complex.

Nginx-RTMP which is a plugin in nginx, which only support 2k clients per CPU, but nginx-rtmp support multiple-processes.

I create go-oryx, because I want to use the multile processes feature of golang. Right now, after lots of performance refine, the go-oryx can support 8k by 4CPU. If golang support writev, I think the performance can improve 200% to 400%, that is about 16k to 32k clients use 4CPU. It's really awesome.

winlinvip commented 8 years ago

@davecheney @ianlancetaylor What's the state of this issue now? Accept or postpone? Any plan?

bradfitz commented 8 years ago

@winlinvip, are you sure your problem is Writev and not, say, allocations? Some of your examples above look very allocation-happy.

Before accepting this proposal, I'd want to see before & after programs with numbers and profiles to show that it helps enough. There will definitely be pushback about expanding the net package's API, which many feel is already too large.

That is, convince us with data perhaps. Implement this in the net package and then show us numbers with how much it helped. We have to be able to reproduce the numbers.

winlinvip commented 8 years ago

@bradfitz I try the no-copy version, to send the []byte one by one, but it will cause lots of syscall which hurts performance more. So I use a big buffer to copy the data to send, then use one syscall to send it. The big buffer solution is better than one by one, I have profile it and test it with mock clients. Please read https://github.com/ossrs/go-oryx/pull/20

winlinvip commented 8 years ago

@bradfitz And I think the writev is not a new API or language feature, it exists on all linux, unix and unix-like os. It's really very useful for high performance server, I found nginx also use writev:

find . -name "*.c"|xargs grep -in "= writev("
./os/unix/ngx_darwin_sendfile_chain.c:285:            rc = writev(c->fd, header.elts, header.nelts);
./os/unix/ngx_files.c:229:        n = writev(file->fd, vec.elts, vec.nelts);
./os/unix/ngx_freebsd_sendfile_chain.c:336:            rc = writev(c->fd, header.elts, header.nelts);
./os/unix/ngx_linux_sendfile_chain.c:294:            rc = writev(c->fd, header.elts, header.nelts);
./os/unix/ngx_writev_chain.c:113:        n = writev(c->fd, vec.elts, vec.nelts);

winlinvip commented 8 years ago

Seems writev was introduced at 2001, https://en.wikipedia.org/wiki/Vectored_I/O

Standards bodies document the applicable functions readv[1] and writev[2] 
in POSIX 1003.1-2001 and the Single UNIX Specification version 2.

kostya-sh commented 8 years ago

@winlinvip, can you try to modify variant 6 to avoid allocations of group buffers by populating buf (https://github.com/winlinvip/go-writev/blob/master/golang/server.go#L123) directly. I think this will be roughly equivalent to calling writev on multiple buffers. What your benchmark shows for this program?

winlinvip commented 8 years ago

@kostya-sh Did u compare the c++ version? The allocation is not the bottle-neck, but the memory copy and syscall.

winlinvip commented 8 years ago

It's really a very basic problem for server. Eventhough the golang is modern language, but it's compile to binary code and execute on linux, the memory copy and syscall is always the bottle-neck for server, I have profile it.

bradfitz commented 8 years ago

Please post a profile here.

kostya-sh commented 8 years ago

@winlinvip, by removing allocations from your test program (variant 6) you will end up with a single syscall per write. This way you can estimate how fast would be your Go program if it used writev. If improvement of Go application is significant then it could be a pro argument for adding writev.

if you can implement writev in your local version of Go standard library, test the performance and post numbers here it would be even better.

winlinvip commented 8 years ago

@kostya-sh I will try to add my writev version to golang stadard library, and give the result.

winlinvip commented 8 years ago

I will test the real streaming server go-oryx later.

ggaaooppeenngg commented 8 years ago

@winlinvip It seems syscalls grow, I guess there are other side effect syscalls arisen. From the diff maybe some of these calls ?

    18.22s  5.47% 44.58%        38s 11.41%  runtime.scanobject
    12.18s  3.66% 48.24%     28.46s  8.55%  runtime.selectgoImpl
     9.93s  2.98% 51.22%      9.93s  2.98%  runtime.heapBitsForObject
     8.89s  2.67% 53.89%     13.78s  4.14%  runtime.greyobject
     8.17s  2.45% 56.35%     37.90s 11.38%  runtime.mallocgc

bradfitz commented 8 years ago

@ggaaooppeenngg, selectGoImpl is about select. The rest are garbage collection: paying the cost of allocating memory.

winlinvip commented 8 years ago

A similar implemnetation for writev by coroutine:

ssize_t st_writev(_st_netfd_t *fd, const struct iovec *iov, int iov_size, st_utime_t timeout)
{
    ssize_t n, rv;
    size_t nleft, nbyte;
    int index, iov_cnt;
    struct iovec *tmp_iov;
    struct iovec local_iov[_LOCAL_MAXIOV];

    /* Calculate the total number of bytes to be sent */
    nbyte = 0;
    for (index = 0; index < iov_size; index++) {
        nbyte += iov[index].iov_len;
    }

    rv = (ssize_t)nbyte;
    nleft = nbyte;
    tmp_iov = (struct iovec *) iov; /* we promise not to modify iov */
    iov_cnt = iov_size;

    while (nleft > 0) {
        if (iov_cnt == 1) {
            if (st_write(fd, tmp_iov[0].iov_base, nleft, timeout) != (ssize_t) nleft) {
                rv = -1;
            }
            break;
        }

        if ((n = writev(fd->osfd, tmp_iov, iov_cnt)) < 0) {
            if (errno == EINTR) {
                continue;
            }
            if (!_IO_NOT_READY_ERROR) {
                rv = -1;
                break;
            }
        } else {
            if ((size_t) n == nleft) {
                break;
            }
            nleft -= n;
            /* Find the next unwritten vector */
            n = (ssize_t)(nbyte - nleft);
            for (index = 0; (size_t) n >= iov[index].iov_len; index++) {
                n -= iov[index].iov_len;
            }

            if (tmp_iov == iov) {
                /* Must copy iov's around */
                if (iov_size - index <= _LOCAL_MAXIOV) {
                    tmp_iov = local_iov;
                } else {
                    tmp_iov = calloc(1, (iov_size - index) * sizeof(struct iovec));
                    if (tmp_iov == NULL) {
                        return -1;
                    }
                }
            }

            /* Fill in the first partial read */
            tmp_iov[0].iov_base = &(((char *)iov[index].iov_base)[n]);
            tmp_iov[0].iov_len = iov[index].iov_len - n;
            index++;
            /* Copy the remaining vectors */
            for (iov_cnt = 1; index < iov_size; iov_cnt++, index++) {
                tmp_iov[iov_cnt].iov_base = iov[index].iov_base;
                tmp_iov[iov_cnt].iov_len = iov[index].iov_len;
            }
        }

        /* Wait until the socket becomes writable */
        if (st_netfd_poll(fd, POLLOUT, timeout) < 0) {
            rv = -1;
            break;
        }
    }

    if (tmp_iov != iov && tmp_iov != local_iov) {
        free(tmp_iov);
    }

    return rv;
}

Terry-Mao commented 8 years ago

@winlinvip your C++ version struct iovec local_iov[_LOCAL_MAXIOV]; is allocated in the stack, your Go version on the heap.

winlinvip commented 8 years ago

@bradfitz Bellow is my research result by go-oryx:

OS	API	CPU	MEM	GC	Connections	Bitrate
Linux	Write	160%	5.4G	40ms	10k	300kbps
Linux	Writev	140%	1.5G	30ms	10k	300kbps

Conclude:

The writev(1440%) cpu usage is almost the same than write(160%) big-buffer.
The writev(1.5G) use less memory than write(5.4G), for writev avoid of copy to a big buffer.
The writev gc(30ms) use less time than write(40ms).

winlinvip commented 8 years ago

Will golang accept the proposal to support netFD.Writev?

bradfitz commented 8 years ago

@winlinvip, thanks for prototyping.

Can you make a small stand-alone program for profiling that can do either multiple writes or writev (controlled by a flag) and link to the code so others can reproduce your results and investigate more?

I also have an API idea. Instead of adding API like (*TCPConn).Writev([][]byte), we could have a special type that implements io.WriterTo that represents the multiple buffers (the [][]byte) and TCPConn's ReadFrom can special-case it into a writev internally, just like it does sendfile.

Then we can add the writev optimization gradually to different OS/conn types over time.

ianlancetaylor commented 8 years ago

A time one especially wants writev is when writing out a UDP packet where the data comes from multiple places. I guess you could do that using io.MultiReader with a list of bytes.Buffer values: io.Copy that to the UDPConn. Sounds a little baroque, though. Also the UDPConn ReadFrom method is not the one that io.ReaderFrom wants. Hmmm.

mikioh commented 8 years ago

@winlinvip,

Can you please change the description appropriately? For example, "proposal: net: do blah blah" (see https://github.com/golang/proposal for further information) because I'm still not sure what's the original objective; it's just to have scatter-gather IO APIs, or more stuff including fancy packet assembly framework and primitives. A few random thoughts.

Please use sendmsg instead of writev internally. Both have the same scatter-gather functionality but sendmsg is more useful for datagram-based unconnected-mode sockets and we can reuse it for #7882 later.
I think it's fine to have different scatter-gather IO APIs between stream-based connected-mode sockets and datagram-based unconnected-mode sockets because as @ianlancetaylor mentioned, interfaces defined in io package are not designed for the latter. Fortunately we can use UDPConn in both styles by using two connection setup functions Dial and ListenPacket appropriately.

winlinvip commented 8 years ago

@mikioh I changed the description to "proposal: net: TCPConn supports Writev", because I think at least the writev is very useful for tcp server. Maybe udp server need this api, but I am not sure. @ianlancetaylor I only write a udp server(rtmfp server), and I am not sure whether writev is useful for udp servers, I will do some research. @bradfitz I will write a standalone program for profiling the write or writev. I will rewrite the program go-writev to use golang code only(the c++ code will move to another branch).

About the api and implementation, for example, to use sendmsg or ReadFrom or MultipleReader, I need some time to study the code.

winlinvip commented 8 years ago

@bradfitz I have rewrite the go-writev to compare the write and writev. The code is go-writev

Writev(28Gbps) is 10 times than Write(2.2Gbps), it's awesome.

The writev is very faster than write by avoid lots of syscall, so go-oryx copy all iovecs to a big-buffer then use write to send(faster than use multiple writes, but make the go-oryx consume lots of memory). For server, the writev can avoid memory copy and make little syscall.

bradfitz commented 8 years ago

Your example at https://github.com/winlinvip/go-writev/blob/master/tcpserver/server.go#L156 isn't exactly fair to the non-writev case. It sends a packet per write. Can you also add a mode where you use a bufio.Writer and write each small []byte to the bufio.Writer, then bufio.Writer.Flush at the end?

winlinvip commented 8 years ago

I refine the code, to always make the packets and send it, because the server will always recv packets from some clients and delivery the packets to another clients. The cr is here.

Seems the writev is 4 times faster than Write+bufio.

bradfitz commented 8 years ago

@winlinvip, but you're allocating per iteration. Let's leave all GC out of this benchmark.

winlinvip commented 8 years ago

@bradfitz Sorry, it's a bug for write+bufio, I should NewWriterSize with the len(iovecs) to avoid multiple syscalls. I have fix it at here.

winlinvip commented 8 years ago

@bradfitz To let GC out of this benchmark, I fix by commit0 and commit1.

winlinvip commented 8 years ago

I test the server with one and multiple clients, by go-writev prototype.

The benchmark is:

ID	OS	API	CPU	MEM	BW	GC	C	CCPU
1	Linux	Write	100%	8M	12810Mbps	0	1	78.5%
2	Linux	Writev	100.3%	7M	11519Mbps	0	1	105.3%
3	Linux	Write	800.1%	546M	77379Mbps	976µs	1000	799.9%
4	Linux	Writev	789.7%	39M	78261Mbps	2.19ms	1000	723.7%
5	Linux	Write	804.3%	1.1G	58926Mbps	3.63ms	2000	798.0%
6	Linux	Writev	790.1%	68M	77904Mbps	4.07ms	2000	674.5%
7	Linux	Write	1246.2%	1.6G	58800Mbps	3.15ms	3000	800.6%
8	Linux	Writev	802.6%	73M	78083Mbps	2.84ms	3000	704.9%
9	Linux	Write	597.8%	2.1G	63852Mbps	6.5ms	4000	599.8%
10	Linux	Writev	593.3%	89M	68355Mbps	6.4ms	4000	510.1%
11	Linux	Write	601.3%	2.6G	73028Mbps	1.06s	5000	542.6%
12	Linux	Writev	591.8%	114M	71373Mbps	6.5ms	5000	525.2%
13	Linux	Write	590.3%	3.1G	72853Mbps	8.04ms	6000	533.7%
14	Linux	Writev	598.6%	142M	82191Mbps	13ms	6000	542.5%
15	Linux	Write	599.8%	3.7G	71342Mbps	9.49ms	7000	547.4%
16	Linux	Writev	592.7%	154M	67619Mbps	26ms	7000	515.9%
17	Linux	Write	598.6%	5.2G	73846Mbps	7.02ms	10000	563.2%
18	Linux	Writev	601.8%	267M	70770Mbps	22.5ms	10000	543.9%

Remark:

C: The number of goroutine for client.
CCPU: The cpu usage for clients.
BW: The bandwidth consumed for server and clients.
API: Use writev or write(+bufio) API.
ID1-8: Server use 8CPUs, client use 8CPUs.
ID9-18: Server use 6CPUs, client use 6CPUs.

The profile for server 1(write+bufio, serve 1 client)

134.45s of 135.18s total (99.46%)
Dropped 43 nodes (cum <= 0.68s)
Showing top 10 nodes out of 12 (cum >= 105.85s)
      flat  flat%   sum%        cum   cum%
   105.15s 77.79% 77.79%    105.40s 77.97%  syscall.Syscall
    15.56s 11.51% 89.30%     15.56s 11.51%  runtime.memmove
    10.34s  7.65% 96.94%     25.90s 19.16%  bufio.(*Writer).Write
     3.20s  2.37% 99.31%    135.01s 99.87%  main.srs_serve

The profile for server 2(writev, serve 1 client):

280.57s of 281.62s total (99.63%)
Dropped 44 nodes (cum <= 1.41s)
      flat  flat%   sum%        cum   cum%
   273.99s 97.29% 97.29%    274.39s 97.43%  syscall.Syscall
     6.45s  2.29% 99.58%    281.38s 99.91%  net.(*netFD).Writev

The profile for server 3(write+bufio, serve 1000 clients):

815.38s of 827.36s total (98.55%)
Dropped 189 nodes (cum <= 4.14s)
Showing top 10 nodes out of 12 (cum >= 622.36s)
      flat  flat%   sum%        cum   cum%
   616.35s 74.50% 74.50%    618.03s 74.70%  syscall.Syscall
    96.01s 11.60% 86.10%     96.01s 11.60%  runtime.memmove
    57.75s  6.98% 93.08%    153.68s 18.57%  bufio.(*Writer).Write
    43.87s  5.30% 98.38%    821.45s 99.29%  main.srs_serve

The profile for server 4(writev, serve 1000 clients):

593.54s of 602.01s total (98.59%)
Dropped 167 nodes (cum <= 3.01s)
      flat  flat%   sum%        cum   cum%
   577.70s 95.96% 95.96%    578.64s 96.12%  syscall.Syscall
    15.54s  2.58% 98.54%    596.44s 99.07%  net.(*netFD).Writev

The profile for server 5(write+bufio, serve 2000 clients):

855.32s of 870.39s total (98.27%)
Dropped 209 nodes (cum <= 4.35s)
Showing top 10 nodes out of 19 (cum >= 5.93s)
      flat  flat%   sum%        cum   cum%
   637.77s 73.27% 73.27%    639.53s 73.48%  syscall.Syscall
       91s 10.46% 83.73%        91s 10.46%  runtime.memmove
    71.65s  8.23% 91.96%    854.65s 98.19%  main.srs_serve
    46.23s  5.31% 97.27%    137.08s 15.75%  bufio.(*Writer).Write

The profile for server 6(writev, serve 2000 clients):

739.71s of 751.98s total (98.37%)
Dropped 205 nodes (cum <= 3.76s)
      flat  flat%   sum%        cum   cum%
   720.12s 95.76% 95.76%    721.32s 95.92%  syscall.Syscall
    19.16s  2.55% 98.31%    743.46s 98.87%  net.(*netFD).Writev

The profile for server 7(write+bufio, serve 3000 clients):

886.09s of 903.74s total (98.05%)
Dropped 236 nodes (cum <= 4.52s)
Showing top 10 nodes out of 22 (cum >= 5.43s)
      flat  flat%   sum%        cum   cum%
   653.56s 72.32% 72.32%    655.35s 72.52%  syscall.Syscall
    93.78s 10.38% 82.69%     93.78s 10.38%  runtime.memmove
    85.32s  9.44% 92.13%    883.87s 97.80%  main.srs_serve
    42.42s  4.69% 96.83%    136.19s 15.07%  bufio.(*Writer).Write

The profile for server 8(writev, serve 3000 clients):

707.97s of 724.03s total (97.78%)
Dropped 230 nodes (cum <= 3.62s)
      flat  flat%   sum%        cum   cum%
   687.89s 95.01% 95.01%    689.35s 95.21%  syscall.Syscall
    19.54s  2.70% 97.71%    712.05s 98.35%  net.(*netFD).Writev

The profile for server 9(write+bufio, serve 4000 clients):

903.63s of 922.03s total (98.00%)
Dropped 251 nodes (cum <= 4.61s)
Showing top 10 nodes out of 13 (cum >= 673.46s)
      flat  flat%   sum%        cum   cum%
   671.39s 72.82% 72.82%    673.29s 73.02%  syscall.Syscall
   109.78s 11.91% 84.72%    109.78s 11.91%  runtime.memmove
    76.56s  8.30% 93.03%    186.19s 20.19%  bufio.(*Writer).Write
    44.25s  4.80% 97.83%    911.20s 98.83%  main.srs_serve

The profile for server 10(writev, serve 4000 clients):

443.05s of 452.06s total (98.01%)
Dropped 234 nodes (cum <= 2.26s)
      flat  flat%   sum%        cum   cum%
   428.19s 94.72% 94.72%    429.08s 94.92%  syscall.Syscall
    12.05s  2.67% 97.39%    443.26s 98.05%  net.(*netFD).Writev

The profile for server 11(write+bufio, serve 5000 clients):

1052.65s of 1074.42s total (97.97%)
Dropped 267 nodes (cum <= 5.37s)
Showing top 10 nodes out of 13 (cum >= 797.94s)
      flat  flat%   sum%        cum   cum%
   795.47s 74.04% 74.04%    797.67s 74.24%  syscall.Syscall
   119.77s 11.15% 85.18%    119.77s 11.15%  runtime.memmove
    82.31s  7.66% 92.85%    201.85s 18.79%  bufio.(*Writer).Write
    53.42s  4.97% 97.82%   1061.58s 98.80%  main.srs_serve

The profile for server 12(writev, serve 5000 clients):

520.34s of 530.21s total (98.14%)
Dropped 228 nodes (cum <= 2.65s)
      flat  flat%   sum%        cum   cum%
   505.73s 95.38% 95.38%    506.76s 95.58%  syscall.Syscall
    14.22s  2.68% 98.06%    523.51s 98.74%  net.(*netFD).Writev

The profile for server 13(write+bufio, serve 6000 clients):

501.03s of 511.33s total (97.99%)
Dropped 240 nodes (cum <= 2.56s)
Showing top 10 nodes out of 12 (cum >= 366.68s)
      flat  flat%   sum%        cum   cum%
   365.52s 71.48% 71.48%    366.60s 71.70%  syscall.Syscall
    69.87s 13.66% 85.15%     69.87s 13.66%  runtime.memmove
    45.21s  8.84% 93.99%    114.96s 22.48%  bufio.(*Writer).Write
    19.65s  3.84% 97.83%    506.76s 99.11%  main.srs_serve

The profile for server 14(writev, serve 6000 clients):

430.34s of 439.27s total (97.97%)
Dropped 224 nodes (cum <= 2.20s)
      flat  flat%   sum%        cum   cum%
   418.56s 95.29% 95.29%    419.27s 95.45%  syscall.Syscall
    11.43s  2.60% 97.89%    433.08s 98.59%  net.(*netFD).Writev

The profile for server 15(write+bufio, serve 7000 clients):

359.48s of 367.11s total (97.92%)
Dropped 251 nodes (cum <= 1.84s)
Showing top 10 nodes out of 12 (cum >= 267.14s)
      flat  flat%   sum%        cum   cum%
   264.35s 72.01% 72.01%    265.22s 72.25%  syscall.Syscall
    47.81s 13.02% 85.03%     47.81s 13.02%  runtime.memmove
    32.32s  8.80% 93.84%     80.04s 21.80%  bufio.(*Writer).Write
    14.45s  3.94% 97.77%    363.58s 99.04%  main.srs_serve

The profile for server 16(writev, serve 7000 clients):

594.12s of 607.03s total (97.87%)
Dropped 243 nodes (cum <= 3.04s)
      flat  flat%   sum%        cum   cum%
   578.62s 95.32% 95.32%    579.63s 95.49%  syscall.Syscall
    15.23s  2.51% 97.83%    598.22s 98.55%  net.(*netFD).Writev

The profile for server 17(write+bufio, serve 10000 clients):

459.34s of 471.92s total (97.33%)
Dropped 270 nodes (cum <= 2.36s)
Showing top 10 nodes out of 13 (cum >= 337.13s)
      flat  flat%   sum%        cum   cum%
   332.91s 70.54% 70.54%    334.19s 70.81%  syscall.Syscall
    58.92s 12.49% 83.03%     58.92s 12.49%  runtime.memmove
    38.67s  8.19% 91.22%     97.49s 20.66%  bufio.(*Writer).Write
    27.79s  5.89% 97.11%    464.45s 98.42%  main.srs_serve

The profile for server 18(writev, serve 10000 clients):

474.04s of 484.01s total (97.94%)
Dropped 256 nodes (cum <= 2.42s)
      flat  flat%   sum%        cum   cum%
   460.48s 95.14% 95.14%    461.33s 95.31%  syscall.Syscall
    12.98s  2.68% 97.82%    477.22s 98.60%  net.(*netFD).Writev

In a word, the write(+bufio) can serve the same bandwidth and clients as writev, but write(+bufio) use nb_clients times memory than writev.

winlinvip commented 8 years ago

@mikioh @bradfitz @ianlancetaylor Ping.

bradfitz commented 8 years ago

Thank you for the data and prototyping but we will not be looking at this until February, after Go 1.6 is out. I want to look at it more in depth but don't have the time at the moment.

winlinvip commented 8 years ago

@bradfitz Got it, thanks~

extemporalgenome commented 8 years ago

This looks like a good candidate for a specialized version of io.MultiWriter (just as http.MaxBytesReader is a specialization of io.LimitReader)

minux commented 8 years ago

How could we optimize io.MultiWriter with Writev? Writev is about writing multiple buffers to one fd, whereas io.MultiWriter is about writing the same buffer to multiple io.Writer's.

kostya-sh commented 8 years ago

FYI, Java has a separate interface to support writev (https://docs.oracle.com/javase/7/docs/api/java/nio/channels/GatheringByteChannel.html).

How about

package io

type GatheringWriter interface {
    func Write(bufs byte[][])
}

?

bradfitz commented 8 years ago

@kostya-sh, that's not even Go syntax for a number of reasons. Go and Java have very different feels, too.

Let's discuss this more in February.

kostya-sh commented 8 years ago

@bradfitz, the syntax can be corrected (and of course the method cannot be called Write because it would clash with io.Writer interface). This is just a suggestion that you might want to consider.

winlinvip commented 8 years ago

@Terry-Mao @davecheney @ianlancetaylor @bradfitz @kostya-sh @ggaaooppeenngg @mikioh @extemporalgenome @minux Seems writev is attractive for all of us, please update this post when discuss writev in February :smile_cat:

extemporalgenome commented 8 years ago

@minux ah. I misread it as writing one buffer to multiple fd's (and I wasn't suggesting modifying io.MultiWriter itself -- but rather adding a specialized function [which is now moot] to the net package)

spance commented 8 years ago

And, should add new API method to support sending multiple WSABuf with WSASend/WSASendto (similar to writev() and struct iovec) for windows.

// src/net/fd_windows.go:L502 Single WSABuf sending

func (fd *netFD) Write(buf []byte) (int, error) {
    if err := fd.writeLock(); err != nil {
        return 0, err
    }
    defer fd.writeUnlock()
    if raceenabled {
        raceReleaseMerge(unsafe.Pointer(&ioSync))
    }
    o := &fd.wop
    o.InitBuf(buf)  // should init multiple WSABuf
    n, err := wsrv.ExecIO(o, "WSASend", func(o *operation) error {
        // #### here arg 2,3 should be *WSAbuf, count ###
        return syscall.WSASend(o.fd.sysfd, &o.buf, 1, &o.qty, 0, &o.o, nil) 
    })
    if _, ok := err.(syscall.Errno); ok {
        err = os.NewSyscallError("wsasend", err)
    }
    return n, err
}

mbeacom commented 8 years ago

@bradfitz ping. Would it be possible for this to be revisited, since the release of Go >= 1.6?

Thank you!

bradfitz commented 8 years ago

@mbeacom, sorry, this also will not make Go 1.7. The 1.7 development freeze is May 1st and this isn't ready.

golang / go

proposal: net: TCPConn supports Writev #13451