pojntfx / go-nbd

Pure Go NBD server and client library.
Apache License 2.0
346 stars 18 forks source link

Make optimal memory allocation #6

Closed pPanda-beta closed 1 year ago

pPanda-beta commented 1 year ago

Considering the fixes of https://github.com/pojntfx/go-nbd/issues/4#issuecomment-1686169698 is done, now the next step will be to support highly concurrent IO.

So far I have observed for any size of IO request 32MiB is allocated, which increases the memory usage unnecessarily.

https://github.com/pojntfx/go-nbd/blob/b8f5045f34d669752a8f2dc431f5cd0bab498685/pkg/server/nbd.go#L319

As an alternative we can allocate the exact amount of the request size. (for both read and write)

---  b := make([]byte, maximumPacketSize) 
    for {
        var requestHeader protocol.TransmissionRequestHeader
...
        case protocol.TRANSMISSION_TYPE_REQUEST_READ:
            if err := binary.Write(conn, binary.BigEndian, protocol.TransmissionReplyHeader{
                ReplyMagic: protocol.TRANSMISSION_MAGIC_REPLY,
                Error:      0,
                Handle:     requestHeader.Handle,
            }); err != nil {
                return err
            }

+++         b := make([]byte, int64(requestHeader.Length))
            n, err := export.Backend.ReadAt(largeB, int64(requestHeader.Offset))

I have tested this and since linux dev mapper devices (if used on top of nbd device) makes very small io requests this has significantly reduced memory usage around 80-90% (since most IO requests do not exceed 4MiB).

pojntfx commented 1 year ago

Hmmm, I seem to remember that there was a reason why it was this way; I'll take a look at this ASAP. I'll probably add a check that it doesn't exceed the maximum length first (so that a client can't do DoS attacks) and then use the request header!

pojntfx commented 1 year ago

So I've tried this patch, but we're getting significantly worse performance:

With the patch:

pojntfx@fedora:~/Projects/r3map$ go build -o /tmp/r3map-benchmark-direct-mount ./cmd/r3map-benchmark-direct-mount/ && sudo /tmp/r3map-benchmark-direct-mount
Open: 10.24412ms
Latency till first two chunks: 362.348µs
Read throughput: 1713.94 MB/s (13711.50 Mb/s)
Read check: Passed
Write throughput: 477.73 MB/s (3821.86 Mb/s)
Sync: 215.875989ms
Write check: Passed
Close: 35.668024ms

Before the patch:

pojntfx@fedora:~/Projects/r3map$ go build -o /tmp/r3map-benchmark-direct-mount ./cmd/r3map-benchmark-direct-mount/ && sudo /tmp/r3map-benchmark-direct-mount
Open: 14.540542ms
Latency till first two chunks: 51.69µs
Read throughput: 2356.33 MB/s (18850.61 Mb/s)
Read check: Passed
Write throughput: 502.19 MB/s (4017.50 Mb/s)
Sync: 197.373918ms
Write check: Passed
Close: 36.308795ms

I'll take a look at why that is exactly.

pojntfx commented 1 year ago

Figured it out, if we only increase the buffer size when we actually need it we're getting the same or better performance: 184321ae147139c910d26c70cdfeb2d125c7d292

pojntfx commented 1 year ago

Released in https://github.com/pojntfx/go-nbd/releases/tag/v0.3.0

pPanda-beta commented 1 year ago

@pojntfx Thanks for the patch.

Basically either of these solutions (what we had before vs what we had now) can not solve the problems together.

The problems:

  1. (I faced) N of connections require N*32MiB memory, for me I was having only 384 connections eating 12GiB of ram (locked).
  2. If we allocate based on "as much as required" frequent allocation and extremely poor garbage collection will happen. https://github.com/pojntfx/go-nbd/issues/6#issuecomment-1749484003 . This will impact overall IOPS and throughput.

With current solution, once the request size increases, we will allocate (or rather increase the buffer size, but we will never decrease). This properly addresses problem#2. But when we are in memory crunch, 12GiB of resident memory is pretty expensive. In my use case there can be large read request spikes, which will allocate upto 12GiB but will never come down. Adaptive allocation / a pool of buffer memories with all powers of 2 (4KiB-32MiB) - request will choose the closes greater buffer (upperbound) / smaller buffer withe few cpu cycles io.SectionReader{}+io.CopyBuffer() can be applied.

At this point I would request to make the strategy configurable so that user can choose what suits best. If performance is the demand, one can choose the increase and hold strategy. If some user wants low memory overhead, they can choose allocate-and-release strategy.

It can also be an interface

type transfer interface {
    CopyN(backend io.ReaderAt, offset int64, size int, socket io.Writer) (int, error)
}

Or by introducing a new Backend interface with bufferless functions.

type ReaderTo interface {
    ReadNTo(offset int64, size int, socket io.Writer) (int, error)
}

type WriterFrom interface {
    WriteNFrom(socket io.Reader, writeAtOffset int64, size int) (int, error)
}

type BackendV2 interface {
    ReaderTo
    WriterFrom

    Size() (int64, error)
    Sync() error
}
pojntfx commented 1 year ago

Hmm, this sounds like a good idea. I'd personally argue against changing the underlying interface (since a change from the typical io.ReaderAt interface will lead to a lot of changes in downstreams), but I think adding an option to choose between the two different strategies (in the Options struct) sounds like a good idea. What do you think?

pojntfx commented 1 year ago

Another option could also be to do this:

b := []byte{}
    for {
        var requestHeader protocol.TransmissionRequestHeader
        if err := binary.Read(conn, binary.BigEndian, &requestHeader); err != nil {
            return err
        }

        if requestHeader.RequestMagic != protocol.TRANSMISSION_MAGIC_REQUEST {
            return ErrInvalidMagic
        }

        length := requestHeader.Length
        if length > defaultMaximumRequestSize {
            return ErrInvalidBlocksize
        }

        if length != uint32(len(b)) {
            b = make([]byte, length)
        }

This way we just change the buffer length if the one we require is different from the one we currently have. From my local r3map benchmark, this leads to the same performance as the "reuse or grow" strategy, and should help fix your problem.

pojntfx commented 1 year ago

See: https://github.com/pojntfx/go-nbd/releases/tag/v0.3.2