Closed josemic closed 10 years ago
That's quite interesting. If you look at the last column of the call that takes the most time (the binary list comprehension), it is much faster than, for example, parser_combinator_bitstring:untilN/4 (5.36 vs 266.40 us). As I'd expect. checksum/1 is just being called a lot (twice per packet?).
I haven't tried to do any optimization so there is room for improvement. But since pkt_ipv4:codec/1 and pkt:tcp/1 have roughly the same times as checksum/1, not sure how much improvement. Even the BIFs like erlang:'++' and lists:keyfind/3 are around 2.5us.
So maybe using an NIF will half the time.
Does it make sense to checksum every packet? The kernel will drop packets with failed checksums, so the application won't see them.
1) The incomming packets are passed from epcap to defrag. (Defrag should be better called "Stream" to stay in suricatas vocabulary, I will rename it soon). "Stream" checks for every incoming packet for checksum and drops those with invalid checksums. Then it forwards it to the corresponding process of that connection. Which then tracks the SYNs, Fins and sequence numbers from both sides, drops packages which come out of sequence, buffers out of order packages, reassembles their data as soon as these are Acknowledged from the peer side. Acked data is buffered upon pieces of 1500 Bytes and then forwarded to content server for pattern matching on strings in part of the 1500 bytes buffer.
2) 266.40 ms apples for gen_fsm:send_event/2 not for parser_combinator_bitstring:untilN/4.
3) pkt:makesum is called 6500 times. Makesum seems to call checksum twice from pkt internally, as checksum is called 13000 times. Assumed, that 1/2 or 1/3 of the packages are ACKs, ca. 3250 to 4333 packages of 1440 bytes have been received in download direction and are forwarded by gen_fsm:send_event/2 as 1500 byte packages to the content server.
4) Yes. It makes sense to checksum every packet. Tracking Sequence Numbers as done by Stream does not work when broken packages are around. But "Stream" is only necessary, if the data is intentionally sent in small pieces, which are made so small that pattern matching does not work or mixed up with broken packages which shall confuse the IDS.
5) I am not yet sure, whether a checksum NIF should be called for every packet or be integrated into epcap and drop all packages with broken checksum.
6) Any idea if http://erllvm.softlab.ntua.gr/ would bring any improvements? Are the NIFs compatible?
Note:
Sorry about the mistake in reading the chart.
The usual strategies apply: minimize the number of checksum operations, reduce the amount of state and optimize.
If you are using hardware offloading, your process will not need to handle the checksums.
If you want to handle the checksums yourself, maybe there is a pcap expression that can minimize some of the packets passed to you.
You can experiment with timings for different checksums implementations. Patches welcome!
No idea about the LLVM port. You can try compiling with HIPE.
1) First source of performance leak was identified: I had running the observer on the same erlang shell like enose. See here is how to run observer correctly: http://www.erlang.org/doc/apps/observer/observer_ug.html This improves the performance significantly, from almost 100% load on 2 processors to 30 % load on 2 processors. 2) Using Hipe: Hipe did not bring any performance improvements..
3) Performance improvements to checksum calculation: See: www.erlang.se/workshop/2005/tcpip_presentation.pdf
1) Here is a proposal based on the second algorithm at: http://locklessinc.com/articles/tcp_checksum/
checksum(Bin) ->
checksum(Bin, 0).
checksum(<<N1:64/integer, N2:64/integer, N3:64/integer, N4:64/integer, N5:64/integer, N6:64/integer, N7:64/integer, N8:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
checksum(ReminderBin, N1+N2+N3+N4+N5+N6+N7+N8+Checksum128Bit);
checksum(<<N1:64/integer, N2:64/integer, N3:64/integer, N4:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
checksum(ReminderBin, N1+N2+N3+N4+Checksum128Bit);
checksum(<<N1:64/integer, N2:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
checksum(ReminderBin, N1+N2+Checksum128Bit);
checksum(<<N:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
checksum(ReminderBin, N+Checksum128Bit);
checksum(<<N:16/integer, ReminderBin/binary>>, Checksum128Bit) ->
checksum(ReminderBin, N+Checksum128Bit);
checksum(<<N:8/integer>>, Checksum128Bit) ->
checksum(<<>>, (N bsl 8)+Checksum128Bit);
checksum(<<>>, Checksum128Bit) ->
Checksum64Bit = foldWithOverflow64(Checksum128Bit),
Checksum32Bit = foldWithOverflow32(Checksum64Bit),
Checksum16Bit = foldWithOverflow16(Checksum32Bit),
Checksum16Bit.
foldWithOverflow64(A) ->
C = A band 16#FFFFFFFFFFFFFFFF,
D = (A bsr 64) band 16#FFFFFFFFFFFFFFFF,
E = (C + D) band 16#FFFFFFFFFFFFFFFF,
case E < D of
true ->
F = E + 1; % overflow
false ->
F = E
end,
F.
foldWithOverflow32(A) ->
C = A band 16#FFFFFFFF,
D = (A bsr 32) band 16#FFFFFFFF,
E = (C + D) band 16#FFFFFFFF,
case E < D of
true ->
F = E + 1; % overflow
false ->
F = E
end,
F.
foldWithOverflow16(A) ->
C = A band 16#FFFF,
D = (A bsr 16) band 16#FFFF,
E = (C + D) band 16#FFFF,
case E < D of
true ->
F = E + 1; % overflow
false ->
F = E
end,
F.
makesum(Hdr) ->
(checksum(Hdr) bxor 16#FFFF) band 16#FFFF. % bitwise-complement
Here is the profiling:
....
****** Process <0.64.0> -- 42.05 % of profiled time ***
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
pkt_tcp:'-options/2-blc$^0/1-0-'/1 7059 0.16 41432 [ 5.87]
pkt_ipproto:codec/1 53518 0.76 196379 [ 3.67]
gen_server:decode_msg/8 53518 0.81 209445 [ 3.91]
gen_server:handle_msg/5 53518 0.82 212114 [ 3.96]
defrag_server:handle_info/2 53518 0.84 216585 [ 4.05]
pkt:decapsulate/1 53518 0.85 219462 [ 4.10]
pkt_ether:type/1 53518 0.86 221997 [ 4.15]
pkt:ipproto/1 53518 0.90 231042 [ 4.32]
pkt_dlt:codec/1 53518 0.92 235651 [ 4.40]
gen_server:handle_common_reply/6 53517 0.92 237503 [ 4.44]
pkt:ether_type/1 53518 1.01 258806 [ 4.84]
pkt_ether:codec/1 53518 1.01 260973 [ 4.88]
gen_server:loop/6 53517 1.15 295331 [ 5.52]
pkt:ether/1 53518 1.49 382783 [ 7.15]
defrag_worker:send_packet/2 53518 1.65 424154 [ 7.93]
gen_server:dispatch/3 53518 1.67 430923 [ 8.05]
defrag_server:get_connection_worker_pid_by_address_tuple/2 53518 1.70 438353 [ 8.19]
lists:reverse/2 107036 1.76 453021 [ 4.23]
pkt:foldWithOverflow16/1 107036 1.76 454072 [ 4.24]
epcap_port_lib:decode/3 53518 1.78 458708 [ 8.57]
pkt:dlt/1 53518 1.89 487963 [ 9.12]
pkt_tcp:options/1 53518 1.91 491872 [ 9.19]
lists:reverse/1 107036 1.93 497500 [ 4.65]
defrag_server:payloadsize/2 53518 2.38 612710 [ 11.45]
pkt_tcp:codec/1 107036 2.56 659537 [ 6.16]
pkt_ipv4:codec/1 107036 2.88 741574 [ 6.93]
pkt:ipv4/1 107036 3.30 848647 [ 7.93]
pkt:foldWithOverflow32/1 107036 3.30 848749 [ 7.93]
pkt:next/1 107036 3.33 857463 [ 8.01]
pkt:decapsulate_next/2 214072 3.39 873092 [ 4.08]
pkt:tcp/1 107036 3.48 895183 [ 8.36]
pkt:foldWithOverflow64/1 107036 3.59 925494 [ 8.65]
pkt_tcp:options/2 219841 3.68 947388 [ 4.31]
pkt:makesum/1 107036 3.77 970288 [ 9.07]
gen_fsm:send_event/2 53518 4.00 1030410 [ 19.25]
pkt:checksum/1 214072 5.50 1416203 [ 6.62]
pkt:checksum/2 1248045 26.28 6767724 [ 5.42]
Not even that pkt:checksum/2 is a little bit faster, also up to 16 times (in average 6-7 times) less called per received packet.
As this algorithm is much more complicated, should we add this under another name e.g. checksum fast or should the existing one be replaced? I will then create a branch and a pull request.
2) The average load goes down from 60% on one processor to 50 % on one processor.
It is unclear, whether the following rebar.config options in msantos/pkt have any performance impact. While the observer graphs do not show any significant difference, when deactivating those. It seems likely that at least activation of coverage analysis costs some performance.
pkt/rebar.config
{xref_checks, [undefined_function_calls]}.
{cover_enabled, true}.
{eunit_opts, [verbose, {report, {eunit_surefire, [{dir, "."}]}}]}.
As I know the {cover_enabled, true} option is related only to eunit. It creates the coverage report in html format after "rebar eunit" run.
@josemic your proposal looks great! That is quite an improvement. I think you should replace the checksum function with your version.
1) I have updated the checksum function above to the latest version. In average the package checksum should now be calculated in 5.42 * 23 = 124 Micro seconds compared with 1354.Micro seconds before.
2) I have updated to the latest version of msantos/pkt. I got the feeling that due to this update, the scheduler utilization has gone up from 50 % to 60 %. Any idea, if there were performance relevant changes?
3) I have added some test cases. When using binary values, I get differences for checksums using odd number of bytes. I assume the old checksum algorithm does not correctly calculate checksums, when feed with binary values of odd length directly instead of records. For the new checksum algorithm, we can remove padding of the #ip4- and #ip6-records.
4) I created a pull request
% UDP pseudoheader checksum
checksum([#ipv4{
saddr = {SA1,SA2,SA3,SA4},
daddr = {DA1,DA2,DA3,DA4}
},
#udp{
ulen = Len
} = Hdr,
Payload
]) ->
UDP = udp(Hdr),
Pad = case Len rem 2 of
0 -> 0;
1 -> 8
end,
checksum(
<<SA1,SA2,SA3,SA4,
DA1,DA2,DA3,DA4,
0:8,
?IPPROTO_UDP:8,
Len:16,
UDP/binary,
Payload/bits,
0:Pad>>
);
I could not test this one, but this is how I understand RFC 793.:
checksum(<<N:8/integer>>, Checksum128Bit) ->
checksum(<<>>, (N bsl 8)+Checksum128Bit);
Just some feedback for the record:
Here is a eprof log when running josemic/enose (defrag branch) using msantos/epcap and msantos/pkt while downloading a file at 1.3 MByte/s.
Offloading is switched off.
Most of the time is spent in the functions pkt:comp and pkt:checksum. Any ideas on how to speed this up?