Packet loss under high load conditions

josemic commented 10 years ago

Finally I got the stream parser of josemic/enose working (under Ubuntu). It tracks the TCP sequence numbers against acknowledge numbers of TCP connections. (The code will be on Github soon.)

It uses Epcap to capture the packets. Under normal load conditions around 80% (according to Erlang observer) there is no packet loss, which is great.

Under high load conditions, when the load reaches 100% for a short time, instead of the Erlang VM queueing the messages, the messages get lost. The packages are still dumped into tcpdump. Epcap/Erlang, however does not seem to get these messages. The results is that enose loosing sequence and indicates failures. Logs have shown that about 26 subsequent packages were lost by Epcap.

Note: No packet loss occurs if Epcap/enose is run from Pcap file, even if the load gets long time to 100%.

There are multiple solutions to this problem:

Under high load conditions, drop packets and indicate packet loss. Helpful would be, if additional information such as the sender / receiver IP addresses could be provided.
Fix the packet loss and deal with the issue inside Erlang. (Any ides how to get Erlang VM load data from inside Erlang.)

ates commented 10 years ago

Which packets rate (pps) were processed by enose without packet loss?

josemic commented 10 years ago

With my 7 years old AMD Dual core PC I achieved 7.7 MBit/s download rate. These are 5316 data packets/s with 1448 bytes payload plus 2658 ACK packets/s in the reverse direction. Thus 7974 / packets/s in total. Note:

All tcp offloading of the network card was deactivated at the network card.
The checksums of the packets were calculated by enose
The checksummed, reassembled and acknowledged packets segmented into 1500 Bytes pieces, forwarded to another process and were searched there for multiple patterns.
I did not use PF_RING.
Logging was mostly idle.

msantos commented 10 years ago

Difficult to debug this without more details. My guess is that libpcap is dropping the packets. epcap reads from the network and writes to stdout, blocking until the erlang side reads it. If the port becomes too busy, the VM will suspend the port/processes.

In the meantime, more packets are arriving from the network. Eventually, some buffer fills up, either in pcap or in the kernel and new events are discarded.

You can track how busy the erlang side is by using erlang:port_info/1.

If my suspicions are correct, to solve it, well, the simplest way is to either run more epcap ports or run more nodes over distribution. We could add a buffer to queue packets but if the erlang side is penalizing the port, we will be back to dropping packets or running out of memory and crashing.

josemic commented 10 years ago

Actually using multiple ports is what Surcatas PF_RING implementation is doing. Here the underlying layer makes sure that the packets of one IP address pair are always routed to the same port. The difficult part should be manage that without memory allocation / deallocation in the driver. Thus you could even run one node per CPU core without having to route messages between the CPU cores. This should improve the multi-core performance, however will not help if all the cores are loaded to 100%.

msantos commented 10 years ago

While it should be possible to divide up the rules amongst epcap ports so IP addresses are constant, I guess you mean moving IP addresses between running epcap instances? Like adjusting the pcap expression on the fly? Can't think of a way to do that, that would not drop packets.

Since epcap is a separate Unix process, you will always have context switches. And since erlang is soft real time, epcap will be suspended if the system is being swamped. I'm sure there is room for improvement though:

get rid of the epcap gen_server bottleneck
if you want to avoid the context switches at the expense of some of the real time properties, have a look at my libpcap NIF:

https://github.com/msantos/ewpcap

if you want to avoid the pcap overhead and deal with packets directly in erlang:

https://github.com/msantos/procket

ates commented 10 years ago

JFY: Got some info from my production system during the low network load, our software is using epcap with pf_ring and pps, which are > 20k now.

msantos commented 10 years ago

@ates wow, that's impressive! thanks for sharing that!

josemic commented 10 years ago

While it should be possible to divide up the rules amongst epcap ports so IP addresses are constant, I guess you mean moving IP addresses between running epcap instances? Like adjusting the pcap expression on the fly? Can't think of a way to do that, that would not drop packets.

Lets assume you know you are running N epcap instances. The easiest way to map IP address and Port combinations into N different buckets n should be something like (simplified): n = (SourceIPAddress+SourcePort+DestinationIPAddress+DestinationPort) mod N While the epcap instance for bucket n is not connected, discard the packet, otherwise forward it to epcap instance n.

When n epcap instances are active, then all packets should be captured. The idea is, that when you have e.g. M cores, use N which is a divisor of M.

ates commented 9 years ago

Just for information, here is a new statistics about the number of packets that our system is processing pre second:

epcap

msantos commented 9 years ago

Very nice! I think that image will make a good addition to the README :)

ates commented 9 years ago

No objections from my side.

ates commented 8 years ago

I think it's ok to close this issue. Michael, do you have objections?

msantos commented 8 years ago

Agreed, thanks @ates !

josemic commented 8 years ago

I think that closing is fine, unless someone runs into performance problems. Still I'd like to draw a conclusion here.

If my suspicions are correct, to solve it, well, the simplest way is to either run more epcap ports or run more nodes over distribution. Assume you want to filter all packets passing on eth0:

sniff:start([{interface, "eth0"}]).

Assume the throughput is not sufficient: There is currently no way to move the first half of the packets to Erlang node instance 1 and the second half of the packets to Erlang node instance 2. With the current epcap configuration options there is no way to do this in a generic way.

While it should be possible to divide up the rules amongst epcap ports so IP addresses are constant,

I would even go further and assume the quadruple of SourceIPAddress, SourcePort, DestinationIPAddress, DestinationPort is constant, as this will provide a better distribution of the traffic to the Erlang nodes and will remain constant for the lifetime any given traffic.

I guess you mean moving IP addresses between running epcap instances? No. Just thought of a static configuration.

Like adjusting the pcap expression on the fly? Can't think of a way to do that, that would not drop packets. Moving them will usually break the application layer (the application using epcap), as a connection started will no longer be stopped on the same node. Thus dynamic arrangement might not be useful - or only useful for newly setup connections. Not sure that dynamic arrangement would be useful.

Lets assume you want to run on a N-core system N-Erlang nodes to distribute the incoming traffic evenly based on e.g. the formula: n = ((SourceIPAddress mod N)+ (SourcePort mod N) + (DestinationIPAddress mod N) + (DestinationPort mod N)) mod N

Where N is the number of Erlang instances to be used, n is the current Erlang instance and IPAddress mod N is defined for IP4 and IP6 Address as the following:

Ip_address = Ip4_address or Ip6_address
Ip4_address 
= {0..255, 0..255, 0..255, 0..255} 
= {Addr4_3, Addr4_2, Addr4_1, Addr4_0} 

Ip4_address mod N 
= {Addr4_3, Addr4_2, Addr4_1, Addr4_0}  mod N 
= ((Addr4_3 mod N) + (Addr4_2 mod N) + (Addr4_1 mod N) + (Addr4_0 mod N)) mod N

Ip6_address
  = {0..65535, 0..65535, 0..65535, 0..65535, 0..65535, 0..65535, 0..65535, 0..65535}
  = {Addr6_7, Addr6_6, Addr6_5, Addr6_4, Addr6_3, Addr6_2, Addr6_1, Addr6_0}

Ip6_address mod N
= {Addr6_7, Addr6_6, Addr6_5, Addr6_4, Addr6_3, Addr6_2, Addr6_1, Addr6_0} mod N
=  ((Addr6_7 mod N) + (Addr6_6 mod N) + (Addr6_5 mod N) + (Addr6_4 mod N)
    + (Addr6_3 mod N) + (Addr6_2 mod N) + (Addr6_1 mod N) + (Addr6_0 mod N)) mod N

Something like below could be used to start N epcap instances on N-nodes:

erl -name 'epcapnode0'
sniff:start([{interface, "eth0"}, {instance, 0, N}]). 

erl -name 'epcapnode1'
sniff:start([{interface, "eth0"}, {instance, 1, N}]).
...

erl -name 'epcapnodeN-1'
sniff:start([{interface, "eth0"}, {instance, N-1, N}]).

and filter out the n-th fraction of the incoming traffic to the Erlang node n using the formula above.

msantos / epcap

Packet loss under high load conditions #18