google / gopacket

Provides packet processing capabilities for Go
BSD 3-Clause "New" or "Revised" License
6.3k stars 1.13k forks source link

Performance or Memory Issues with Gopacket #1106

Open huhuegg opened 1 year ago

huhuegg commented 1 year ago

Environment and Phenomenon Description: I am using Gopacket for data analysis of TCP sessions. The data is received from a traffic mirror port on a switch, and other servers perform pre-processing and forward only a portion of the TCP traffic to the server where Gopacket is installed. The TCP portion of the data packets is at a rate of 250,000 to 550,000 packets per second, with a data rate of 0.7-2Gbps. The memory usage of the program continues to increase, leading to increased GC pressure and severely affecting the efficiency of the program.

The following optimizations have been made to the program:

  1. Use DPDK to improve packet capture efficiency and ensure that incoming data packets are not lost.
  2. Assign multiple Gopacket instances based on the hash of the data packet to avoid performance issues with channels.
  3. Set a timer to call FlushCloseOlderThan(time.Now().Add(time.Minute * -1)) every minute.
  4. Use sync.Pool to manage Gopacket.Packet.
  5. Use DecodeFromBytes to parse only Ethernet, IPv4, and TCP data, and manage them using sync.Pool.
  6. Use cgo to manage request and response data caches for TCP sessions to avoid GC.

Problem Analysis: After monitoring the program continuously for 12 hours using Pyroscope, some problems were discovered by comparing the inuse_object samples. Many Stream objects were not released and remained in use throughout the period:

  1. The New method of Gopacket's StreamFactory creates a large number of reassembly.Stream objects that remain in use, and after 12 hours, the number of these objects in inuse_object reaches 56 million.
  2. reassembly.getConnect also has a large number of objects, and after 12 hours, the number of these objects in inuse_object reaches 57.9 million.
huhuegg commented 1 year ago

I found that the grow method of the StreamPool is called very frequently. From the source code, I can see that the conns and free variables are initialized with a constant value of initialAllocSize=1024 when creating the StreamPool in NewStreamPool. This value is not suitable for high connection scenarios and causes frequent resizing of the pool. I suggest adjusting this value to a variable parameter.

However, adjusting this value will only reduce the number of grow operations and may slightly improve performance, but it does not solve my problem.

feiyangbeyond commented 1 year ago

Use DPDK to improve packet capture efficiency and ensure that incoming data packets are not lost.

How did you do it? I encountered similar problems in similar scenes.

huhuegg commented 1 year ago

When using gopacket reassembly at high concurrency, a large number of objects and connections will be created, which will trigger GC more frequently. When STW will stop the program's packet reception, adjusting to manual GC will not solve the STW problem. It is recommended that reassembly can support the manual memory management of arena, avoid the pressure of GC to the greatest extent, and use gopacket to meet the needs of supporting production business. image

huhuegg commented 1 year ago

The following chart shows the impact of GC on the program. The pps/bps of receiving packets during GC STW has dropped significantly. In the tcp chart, it can be clearly seen that the tcp_session_created count of gopacket reassembly has dropped precipitously, and the corresponding http req&resp analysis data has also followed The number of session creations decreases together.

So it is very difficult for me to use gopacket reassembly for slightly larger traffic processing when connecting to the production environment.

Any good suggestions? image

huhuegg commented 1 year ago

In the working scenario that is only used for traffic policy forwarding, because gopacket reassembly is not used, GC has little impact on the performance of the program, the processing traffic can exceed 10Gbps, and the packet rate can reach 10 million pps. As can be seen from the figure below, because the Heap and Stack are both very small, GC has no impact on the program, but this problem cannot be avoided after using gopacket reassembly. image gopacket is already the best tool for reorganizing tcp session analysis data under golang, and it is expected to be optimized better and better.

huhuegg commented 1 year ago

Use DPDK to improve packet capture efficiency and ensure that incoming data packets are not lost.

How did you do it? I encountered similar problems in similar scenes.

try nff-go

huhuegg commented 1 year ago

reassembly optimization suggestions

Currently using gopacket to convert network packets into gopacket.Packet and analyze and reassembly these two scenarios will allocate a large number of objects, and gc is very frequent in heavy traffic scenarios.

In practice, I use arena instead of PacketPool.NewPacket and PacketPool.ReturnPackToPool has a better performance improvement.

To avoid performance bottlenecks, it is recommended to evaluate using arena in gopacket/reassembly to optimize performance.

evanzhang87 commented 1 year ago

hello,I have a similar problem, mainly mem usage, I use the code from examples/httpassembly, and I tried to parse http response, it takes lots of memory when response is big, so I try to use arena like this:

func (h *httpStream) run() {
    buf := bufio.NewReader(&h.r)
    defer h.close()
    mem := arena.NewArena()
    defer mem.Free()
    var err error
    for {
        if h.isClient {
            req := arena.New[http.Request](mem)
            req, err = http.ReadRequest(buf)
            if err == io.EOF || err == io.ErrUnexpectedEOF {
                // We must read until we see an EOF... very important!
                return
            } else if err != nil {
                log.Println("Error reading stream", h.net, h.transport, ":", err)
                return
            } else {
                _ = tcpreader.DiscardBytesToEOF(req.Body)
                req.Body.Close()
                //log.Println("Received request from stream", h.net, h.transport, ":", req, "with", bodyBytes, "bytes in request body")
            }
        } else {
            res := arena.New[http.Response](mem)
            res, err = http.ReadResponse(buf, nil)
            if err == io.EOF || err == io.ErrUnexpectedEOF {
                // We must read until we see an EOF... very important!
                return
            } else if err != nil {
                log.Println("Error reading stream", h.net, h.transport, ":", err)
                return
            } else {
                _ = tcpreader.DiscardBytesToEOF(res.Body)
                res.Body.Close()
                //log.Println("Received request from stream", h.net, h.transport, ":", req, "with", bodyBytes, "bytes in request body")
            }
        }
    }
}

But it used more mem then before.

// It is code without arena
func (h *httpStream) run() {
    buf := bufio.NewReader(&h.r)
    defer h.close()
    for {
        if h.isClient {
            req, err := http.ReadRequest(buf)
            if err == io.EOF || err == io.ErrUnexpectedEOF {
                // We must read until we see an EOF... very important!
                return
            } else if err != nil {
                log.Println("Error reading stream", h.net, h.transport, ":", err)
                return
            } else {
                _ = tcpreader.DiscardBytesToEOF(req.Body)
                req.Body.Close()
                //log.Println("Received request from stream", h.net, h.transport, ":", req, "with", bodyBytes, "bytes in request body")
            }
        } else {
            res, err := http.ReadResponse(buf, nil)
            if err == io.EOF || err == io.ErrUnexpectedEOF {
                // We must read until we see an EOF... very important!
                return
            } else if err != nil {
                log.Println("Error reading stream", h.net, h.transport, ":", err)
                return
            } else {
                _ = tcpreader.DiscardBytesToEOF(res.Body)
                res.Body.Close()
                //log.Println("Received request from stream", h.net, h.transport, ":", req, "with", bodyBytes, "bytes in request body")
            }
        }
    }
}
huhuegg commented 1 year ago

My understanding of arena is not to reduce memory usage, but to alleviate the performance impact of GC and GC STW on the program

huhuegg commented 1 year ago

Through pprof, I can see that connection objects are created very frequently. When there are a large number of connections, StreamPool will grow frequently and have a large overhead. If you modify reassembly/memory.go, using arena to manage connection objects should effectively alleviate the problem of GC. However, when I tried it, I found it difficult to grasp the release timing. There are a lot of accesses to conn in tcpassembly.go, and an error message "accessed data from freed user arena" will occur.

huhuegg commented 1 year ago

I try to use sync.Pool to manage connect objects, and cancel the use of free arrays to manage connect objects that need to be reused. I found that there is not much difference in initial performance, but after a long time using sync.Pool will take up cpu when GC is triggered. Very high, and the CPU time is relatively long, the effect is not ideal.