msantos / pkt

Erlang network protocol library
http://blog.listincomprehension.com/search/label/epcap
BSD 3-Clause "New" or "Revised" License
150 stars 44 forks source link

Checksum calculation is slow #10

Closed josemic closed 10 years ago

josemic commented 10 years ago

Here is a eprof log when running josemic/enose (defrag branch) using msantos/epcap and msantos/pkt while downloading a file at 1.3 MByte/s.

Offloading is switched off.

Most of the time is spent in the functions pkt:comp and pkt:checksum. Any ideas on how to speed this up?

28> eprof:analyze(total).                                            
FUNCTION                                                                  CALLS      %     TIME  [uS / CALLS]
--------                                                                  -----    ---     ----  [----------]
io:o_request/3                                                                1   0.00        2  [      2.00]
io:format/3                                                                   1   0.00        3  [      3.00]
io:execute_request/2                                                          1   0.00        3  [      3.00]
erlang:group_leader/0                                                         1   0.00        3  [      3.00]
io:bc_req/3                                                                   1   0.00        4  [      4.00]
gen:do_call/4                                                                 1   0.00        4  [      4.00]
gen_server:call/3                                                             1   0.00        4  [      4.00]
supervisor:call/2                                                             1   0.00        4  [      4.00]
net_kernel:dflag_unicode_io/1                                                 1   0.00        4  [      4.00]
erlang:whereis/1                                                              1   0.00        5  [      5.00]
erlang:make_ref/0                                                             1   0.00        5  [      5.00]
io:request/2                                                                  1   0.00        6  [      6.00]
io:default_output/0                                                           1   0.00        6  [      6.00]
defrag_server:insert_element/2                                                1   0.00        6  [      6.00]
supervisor:start_child/2                                                      1   0.00        6  [      6.00]
io:format/2                                                                   1   0.00        7  [      7.00]
io:io_request/2                                                               1   0.00        7  [      7.00]
gen:call/4                                                                    1   0.00        7  [      7.00]
defrag_root_sup:start_worker/3                                                1   0.00        9  [      9.00]
io:wait_io_mon_reply/2                                                        1   0.00       10  [     10.00]
erlang:setelement/3                                                           2   0.00       11  [      5.50]
erlang:demonitor/2                                                            2   0.00       14  [      7.00]
erlang:ref_to_list/1                                                          1   0.00       16  [     16.00]
erlang:list_to_atom/1                                                         1   0.00       16  [     16.00]
erlang:monitor/2                                                              2   0.00       17  [      8.50]
erlang:send/3                                                                 1   0.00       20  [     20.00]
parser_combinator_bitstring:binary_revers/1                                   4   0.00       22  [      5.50]
parser_combinator_bitstring:binary_revers/2                                  10   0.00       28  [      2.80]
parser_combinator_bitstring:matchCaseInsensitive/3                           30   0.00       94  [      3.13]
parser_combinator_bitstring:'-pBinarystringCaseInsensitive/1-fun-0-'/2       30   0.00      162  [      5.40]
defrag_server:handle_info/2                                                3250   0.03    10841  [      3.34]
gen_server:handle_common_reply/6                                           3249   0.03    10978  [      3.38]
s6:'-s/0-fun-0-'/1                                                         2042   0.04    12754  [      6.25]
s6:'-s/0-fun-1-'/1                                                         2042   0.04    12965  [      6.35]
parser_combinator_bitstring:'-pThen/2-fun-0-'/3                            4084   0.04    13434  [      3.29]
pkt_ether:type/1                                                           6500   0.04    14914  [      2.29]
parser_combinator_bitstring:'-pBetweenN/3-fun-0-'/4                        4084   0.04    15617  [      3.82]
pkt_tcp:options/1                                                          3250   0.04    15959  [      4.91]
pkt:ipproto/1                                                              6500   0.04    16307  [      2.51]
pkt_dlt:codec/1                                                            6500   0.05    16823  [      2.59]
pkt_ipproto:codec/1                                                        6500   0.05    17206  [      2.65]
defrag_worker:send_packet/2                                                3248   0.05    18095  [      5.57]
pkt:ether_type/1                                                           6500   0.05    18949  [      2.92]
epcap_port_lib:header/1                                                    3250   0.05    19190  [      5.90]
epcap_port_lib:tcp_flags/1                                                 3250   0.05    19954  [      6.14]
epcap_port_lib:decode/2                                                    3250   0.06    22116  [      6.80]
defrag_server:get_connection_worker_pid_by_address_tuple/2                 3254   0.06    22749  [      6.99]
pkt_ether:codec/1                                                          6500   0.06    23333  [      3.59]
parser_combinator_bitstring:pBetweenN/3                                    4084   0.07    24725  [      6.05]
gen_server:handle_msg/5                                                    7334   0.07    25430  [      3.47]
gen_server:decode_msg/8                                                    7334   0.07    25797  [      3.52]
lists:reverse/2                                                            9747   0.07    26658  [      2.73]
parser_combinator_bitstring:pThen/2                                        4084   0.08    28802  [      7.05]
parser_combinator_bitstring:pBinarystringCaseInsensitive/1                 4084   0.08    29844  [      7.31]
lists:reverse/1                                                            9750   0.08    30326  [      3.11]
pkt:dlt/1                                                                  6500   0.08    30687  [      4.72]
lists:member/2                                                            13000   0.09    31471  [      2.42]
parser_combinator_bitstring:pBinarystring/1                                4084   0.09    32389  [      7.93]
defrag_server:payloadsize/2                                                3250   0.09    32952  [     10.14]
content_worker:handle_call/3                                               4084   0.09    34026  [      8.33]
pkt_tcp:options/2                                                         12995   0.10    34730  [      2.67]
pkt:decapsulate/1                                                          6500   0.10    35189  [      5.41]
inet_parse:ntoa/1                                                          6500   0.10    35428  [      5.45]
pkt:ether/1                                                                6500   0.10    36081  [      5.55]
parser_combinator_bitstring:'-pUntilN/2-fun-0-'/3                          4086   0.10    36959  [      9.05]
epcap_port_lib:port/2                                                      6500   0.10    36995  [      5.69]
pkt_tcp:codec/1                                                            9750   0.11    38266  [      3.92]
parser_combinator_bitstring:parse/2                                        8168   0.11    40590  [      4.97]
pkt:makesum/1                                                              6500   0.11    40611  [      6.25]
lists:keyfind/3                                                           16250   0.12    42029  [      2.59]
pkt:ipv4/1                                                                 9750   0.14    49418  [      5.07]
parser_combinator_bitstring:pUntilN/2                                      8168   0.14    51463  [      6.30]
gen_server:reply/2                                                         4084   0.14    52419  [     12.84]
pkt:checksum/1                                                            13000   0.14    52540  [      4.04]
pkt_ipv4:codec/1                                                           9750   0.15    53636  [      5.50]
erlang:'++'/2                                                             19502   0.16    57459  [      2.95]
pkt:tcp/1                                                                  9750   0.18    64046  [      6.57]
pkt:decapsulate_next/2                                                    26000   0.18    64203  [      2.47]
pkt:next/1                                                                13000   0.19    69929  [      5.38]
gen_server:loop/6                                                          7333   0.20    73841  [     10.07]
epcap_port_lib:'-tcp_flags/1-lc$^0/1-0-'/1                                29250   0.23    83148  [      2.84]
erlang:integer_to_list/1                                                  26001   0.24    88303  [      3.40]
gen_server:dispatch/3                                                      3250   0.28   101920  [     31.36]
gen_fsm:send_event/2                                                       3248   2.38   865274  [    266.40]
parser_combinator_bitstring:untilN/4                                     412114   3.94  1432868  [      3.48]
parser_combinator_bitstring:match/3                                      413700   4.41  1602864  [      3.87]
parser_combinator_bitstring:'-pBinarystring/1-fun-0-'/2                  412084   8.17  2974405  [      7.22]
pkt:compl/2                                                             1635928  10.43  3793934  [      2.32]
lists:foldl/3                                                           1642427  10.49  3816434  [      2.32]
pkt:compl/1                                                             1635927  10.53  3830908  [      2.34]
pkt:'-checksum/1-fun-0-'/2                                              1635928  20.27  7376856  [      4.51]
pkt:'-checksum/1-blc$^0/1-0-'/1                                         1642936  24.18  8798810  [      5.36]
msantos commented 10 years ago

That's quite interesting. If you look at the last column of the call that takes the most time (the binary list comprehension), it is much faster than, for example, parser_combinator_bitstring:untilN/4 (5.36 vs 266.40 us). As I'd expect. checksum/1 is just being called a lot (twice per packet?).

I haven't tried to do any optimization so there is room for improvement. But since pkt_ipv4:codec/1 and pkt:tcp/1 have roughly the same times as checksum/1, not sure how much improvement. Even the BIFs like erlang:'++' and lists:keyfind/3 are around 2.5us.

So maybe using an NIF will half the time.

Does it make sense to checksum every packet? The kernel will drop packets with failed checksums, so the application won't see them.

josemic commented 10 years ago

1) The incomming packets are passed from epcap to defrag. (Defrag should be better called "Stream" to stay in suricatas vocabulary, I will rename it soon). "Stream" checks for every incoming packet for checksum and drops those with invalid checksums. Then it forwards it to the corresponding process of that connection. Which then tracks the SYNs, Fins and sequence numbers from both sides, drops packages which come out of sequence, buffers out of order packages, reassembles their data as soon as these are Acknowledged from the peer side. Acked data is buffered upon pieces of 1500 Bytes and then forwarded to content server for pattern matching on strings in part of the 1500 bytes buffer.

2) 266.40 ms apples for gen_fsm:send_event/2 not for parser_combinator_bitstring:untilN/4.

3) pkt:makesum is called 6500 times. Makesum seems to call checksum twice from pkt internally, as checksum is called 13000 times. Assumed, that 1/2 or 1/3 of the packages are ACKs, ca. 3250 to 4333 packages of 1440 bytes have been received in download direction and are forwarded by gen_fsm:send_event/2 as 1500 byte packages to the content server.

4) Yes. It makes sense to checksum every packet. Tracking Sequence Numbers as done by Stream does not work when broken packages are around. But "Stream" is only necessary, if the data is intentionally sent in small pieces, which are made so small that pattern matching does not work or mixed up with broken packages which shall confuse the IDS.

5) I am not yet sure, whether a checksum NIF should be called for every packet or be integrated into epcap and drop all packages with broken checksum.

6) Any idea if http://erllvm.softlab.ntua.gr/ would bring any improvements? Are the NIFs compatible?

Note:

msantos commented 10 years ago

Sorry about the mistake in reading the chart.

The usual strategies apply: minimize the number of checksum operations, reduce the amount of state and optimize.

If you are using hardware offloading, your process will not need to handle the checksums.

If you want to handle the checksums yourself, maybe there is a pcap expression that can minimize some of the packets passed to you.

You can experiment with timings for different checksums implementations. Patches welcome!

No idea about the LLVM port. You can try compiling with HIPE.

josemic commented 10 years ago

1) First source of performance leak was identified: I had running the observer on the same erlang shell like enose. See here is how to run observer correctly: http://www.erlang.org/doc/apps/observer/observer_ug.html This improves the performance significantly, from almost 100% load on 2 processors to 30 % load on 2 processors. 2) Using Hipe: Hipe did not bring any performance improvements..

3) Performance improvements to checksum calculation: See: www.erlang.se/workshop/2005/tcpip_presentation.pdf‎

josemic commented 10 years ago

1) Here is a proposal based on the second algorithm at: http://locklessinc.com/articles/tcp_checksum/

checksum(Bin) ->
    checksum(Bin, 0).

checksum(<<N1:64/integer, N2:64/integer, N3:64/integer, N4:64/integer, N5:64/integer, N6:64/integer, N7:64/integer, N8:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N1+N2+N3+N4+N5+N6+N7+N8+Checksum128Bit);

checksum(<<N1:64/integer, N2:64/integer, N3:64/integer, N4:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N1+N2+N3+N4+Checksum128Bit);

checksum(<<N1:64/integer, N2:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N1+N2+Checksum128Bit);

checksum(<<N:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N+Checksum128Bit);

checksum(<<N:16/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N+Checksum128Bit);

checksum(<<N:8/integer>>, Checksum128Bit) ->
    checksum(<<>>, (N bsl 8)+Checksum128Bit);

checksum(<<>>, Checksum128Bit) ->
        Checksum64Bit = foldWithOverflow64(Checksum128Bit),
        Checksum32Bit = foldWithOverflow32(Checksum64Bit),
        Checksum16Bit = foldWithOverflow16(Checksum32Bit),
        Checksum16Bit.

foldWithOverflow64(A) ->
        C = A band 16#FFFFFFFFFFFFFFFF,
    D = (A bsr 64) band 16#FFFFFFFFFFFFFFFF,
    E = (C + D) band 16#FFFFFFFFFFFFFFFF,
        case E < D of
        true ->
            F = E + 1; % overflow
        false ->
            F = E
         end,
         F.

foldWithOverflow32(A) ->
        C = A band 16#FFFFFFFF,
    D = (A bsr 32) band 16#FFFFFFFF,
    E = (C + D) band 16#FFFFFFFF, 
        case E < D of
        true ->
            F = E + 1; % overflow
        false ->
            F = E
         end,
         F.

foldWithOverflow16(A) ->
        C = A band 16#FFFF,
    D = (A bsr 16) band 16#FFFF,
    E = (C + D) band 16#FFFF, 
        case E < D of
        true ->
            F = E + 1; % overflow
        false ->
            F = E
         end,
         F.         

makesum(Hdr) -> 
    (checksum(Hdr) bxor 16#FFFF) band 16#FFFF. % bitwise-complement

Here is the profiling:

....
****** Process <0.64.0>    -- 42.05 % of profiled time *** 
FUNCTION                                                      CALLS      %     TIME  [uS / CALLS]
--------                                                      -----    ---     ----  [----------]
pkt_tcp:'-options/2-blc$^0/1-0-'/1                             7059   0.16    41432  [      5.87]
pkt_ipproto:codec/1                                           53518   0.76   196379  [      3.67]
gen_server:decode_msg/8                                       53518   0.81   209445  [      3.91]
gen_server:handle_msg/5                                       53518   0.82   212114  [      3.96]
defrag_server:handle_info/2                                   53518   0.84   216585  [      4.05]
pkt:decapsulate/1                                             53518   0.85   219462  [      4.10]
pkt_ether:type/1                                              53518   0.86   221997  [      4.15]
pkt:ipproto/1                                                 53518   0.90   231042  [      4.32]
pkt_dlt:codec/1                                               53518   0.92   235651  [      4.40]
gen_server:handle_common_reply/6                              53517   0.92   237503  [      4.44]
pkt:ether_type/1                                              53518   1.01   258806  [      4.84]
pkt_ether:codec/1                                             53518   1.01   260973  [      4.88]
gen_server:loop/6                                             53517   1.15   295331  [      5.52]
pkt:ether/1                                                   53518   1.49   382783  [      7.15]
defrag_worker:send_packet/2                                   53518   1.65   424154  [      7.93]
gen_server:dispatch/3                                         53518   1.67   430923  [      8.05]
defrag_server:get_connection_worker_pid_by_address_tuple/2    53518   1.70   438353  [      8.19]
lists:reverse/2                                              107036   1.76   453021  [      4.23]
pkt:foldWithOverflow16/1                                     107036   1.76   454072  [      4.24]
epcap_port_lib:decode/3                                       53518   1.78   458708  [      8.57]
pkt:dlt/1                                                     53518   1.89   487963  [      9.12]
pkt_tcp:options/1                                             53518   1.91   491872  [      9.19]
lists:reverse/1                                              107036   1.93   497500  [      4.65]
defrag_server:payloadsize/2                                   53518   2.38   612710  [     11.45]
pkt_tcp:codec/1                                              107036   2.56   659537  [      6.16]
pkt_ipv4:codec/1                                             107036   2.88   741574  [      6.93]
pkt:ipv4/1                                                   107036   3.30   848647  [      7.93]
pkt:foldWithOverflow32/1                                     107036   3.30   848749  [      7.93]
pkt:next/1                                                   107036   3.33   857463  [      8.01]
pkt:decapsulate_next/2                                       214072   3.39   873092  [      4.08]
pkt:tcp/1                                                    107036   3.48   895183  [      8.36]
pkt:foldWithOverflow64/1                                     107036   3.59   925494  [      8.65]
pkt_tcp:options/2                                            219841   3.68   947388  [      4.31]
pkt:makesum/1                                                107036   3.77   970288  [      9.07]
gen_fsm:send_event/2                                          53518   4.00  1030410  [     19.25]
pkt:checksum/1                                               214072   5.50  1416203  [      6.62]
pkt:checksum/2                                              1248045  26.28  6767724  [      5.42]

Not even that pkt:checksum/2 is a little bit faster, also up to 16 times (in average 6-7 times) less called per received packet.

As this algorithm is much more complicated, should we add this under another name e.g. checksum fast or should the existing one be replaced? I will then create a branch and a pull request.

2) The average load goes down from 60% on one processor to 50 % on one processor.

josemic commented 10 years ago

It is unclear, whether the following rebar.config options in msantos/pkt have any performance impact. While the observer graphs do not show any significant difference, when deactivating those. It seems likely that at least activation of coverage analysis costs some performance.

pkt/rebar.config

{xref_checks, [undefined_function_calls]}.
{cover_enabled, true}.
{eunit_opts, [verbose, {report, {eunit_surefire, [{dir, "."}]}}]}.
ates commented 10 years ago

As I know the {cover_enabled, true} option is related only to eunit. It creates the coverage report in html format after "rebar eunit" run.

msantos commented 10 years ago

@josemic your proposal looks great! That is quite an improvement. I think you should replace the checksum function with your version.

josemic commented 10 years ago

1) I have updated the checksum function above to the latest version. In average the package checksum should now be calculated in 5.42 * 23 = 124 Micro seconds compared with 1354.Micro seconds before.

2) I have updated to the latest version of msantos/pkt. I got the feeling that due to this update, the scheduler utilization has gone up from 50 % to 60 %. Any idea, if there were performance relevant changes?

3) I have added some test cases. When using binary values, I get differences for checksums using odd number of bytes. I assume the old checksum algorithm does not correctly calculate checksums, when feed with binary values of odd length directly instead of records. For the new checksum algorithm, we can remove padding of the #ip4- and #ip6-records.

4) I created a pull request

msantos commented 10 years ago
  1. Nice work!
  2. This is between the old dev version you were using and the current master? The old version of decapsulate/1 hard coded the next protocol types. The new version calls a function which calls out to a function in the codec module to look up the protocol. This might be causing the slow down you are seeing.
  3. Could be. Think the old checksum was just a translation from C to Erlang of the checksum algorithm in BSD ping. Which padding are you referring to? The pad after the IPv4 options? I believe the padding is a requirement.
  4. Thanks again!
josemic commented 10 years ago
  1. Yes. Between the old dev version I was and the current master.
  2. To the padding below, e.g. for UDP. Same for TCP.:
% UDP pseudoheader checksum
checksum([#ipv4{
        saddr = {SA1,SA2,SA3,SA4},
        daddr = {DA1,DA2,DA3,DA4}
    },
    #udp{
        ulen = Len
    } = Hdr,
    Payload
]) ->
    UDP = udp(Hdr),
    Pad = case Len rem 2 of
        0 -> 0;
        1 -> 8
    end,
    checksum(
        <<SA1,SA2,SA3,SA4,
          DA1,DA2,DA3,DA4,
          0:8,
          ?IPPROTO_UDP:8,
          Len:16,
          UDP/binary,
          Payload/bits,
          0:Pad>>
    );

I could not test this one, but this is how I understand RFC 793.:

checksum(<<N:8/integer>>, Checksum128Bit) ->
         checksum(<<>>, (N bsl 8)+Checksum128Bit);
msantos commented 10 years ago
  1. I suspected there would be a performance impact with the inter-module calls. I planned on benchmarking the old vs new versions but didn't get around to it. I think there are could be some easy improvements, like inlining some of the functions or moving the functions back into the pkt module.
  2. Ok, looks good to me!
josemic commented 10 years ago
  1. I did some test with lining in pkt.erl, but I did not see any improvements.
    I tired:
    -compile(inline).
    or:
    -compile(inline).
    -compile({inline_size,100}).
josemic commented 10 years ago

Just some feedback for the record:

overloadcondition