Checksum calculation is slow

Here is a eprof log when running josemic/enose (defrag branch) using msantos/epcap and msantos/pkt while downloading a file at 1.3 MByte/s.

Offloading is switched off.

Most of the time is spent in the functions pkt:comp and pkt:checksum. Any ideas on how to speed this up?

28> eprof:analyze(total).                                            
FUNCTION                                                                  CALLS      %     TIME  [uS / CALLS]
--------                                                                  -----    ---     ----  [----------]
io:o_request/3                                                                1   0.00        2  [      2.00]
io:format/3                                                                   1   0.00        3  [      3.00]
io:execute_request/2                                                          1   0.00        3  [      3.00]
erlang:group_leader/0                                                         1   0.00        3  [      3.00]
io:bc_req/3                                                                   1   0.00        4  [      4.00]
gen:do_call/4                                                                 1   0.00        4  [      4.00]
gen_server:call/3                                                             1   0.00        4  [      4.00]
supervisor:call/2                                                             1   0.00        4  [      4.00]
net_kernel:dflag_unicode_io/1                                                 1   0.00        4  [      4.00]
erlang:whereis/1                                                              1   0.00        5  [      5.00]
erlang:make_ref/0                                                             1   0.00        5  [      5.00]
io:request/2                                                                  1   0.00        6  [      6.00]
io:default_output/0                                                           1   0.00        6  [      6.00]
defrag_server:insert_element/2                                                1   0.00        6  [      6.00]
supervisor:start_child/2                                                      1   0.00        6  [      6.00]
io:format/2                                                                   1   0.00        7  [      7.00]
io:io_request/2                                                               1   0.00        7  [      7.00]
gen:call/4                                                                    1   0.00        7  [      7.00]
defrag_root_sup:start_worker/3                                                1   0.00        9  [      9.00]
io:wait_io_mon_reply/2                                                        1   0.00       10  [     10.00]
erlang:setelement/3                                                           2   0.00       11  [      5.50]
erlang:demonitor/2                                                            2   0.00       14  [      7.00]
erlang:ref_to_list/1                                                          1   0.00       16  [     16.00]
erlang:list_to_atom/1                                                         1   0.00       16  [     16.00]
erlang:monitor/2                                                              2   0.00       17  [      8.50]
erlang:send/3                                                                 1   0.00       20  [     20.00]
parser_combinator_bitstring:binary_revers/1                                   4   0.00       22  [      5.50]
parser_combinator_bitstring:binary_revers/2                                  10   0.00       28  [      2.80]
parser_combinator_bitstring:matchCaseInsensitive/3                           30   0.00       94  [      3.13]
parser_combinator_bitstring:'-pBinarystringCaseInsensitive/1-fun-0-'/2       30   0.00      162  [      5.40]
defrag_server:handle_info/2                                                3250   0.03    10841  [      3.34]
gen_server:handle_common_reply/6                                           3249   0.03    10978  [      3.38]
s6:'-s/0-fun-0-'/1                                                         2042   0.04    12754  [      6.25]
s6:'-s/0-fun-1-'/1                                                         2042   0.04    12965  [      6.35]
parser_combinator_bitstring:'-pThen/2-fun-0-'/3                            4084   0.04    13434  [      3.29]
pkt_ether:type/1                                                           6500   0.04    14914  [      2.29]
parser_combinator_bitstring:'-pBetweenN/3-fun-0-'/4                        4084   0.04    15617  [      3.82]
pkt_tcp:options/1                                                          3250   0.04    15959  [      4.91]
pkt:ipproto/1                                                              6500   0.04    16307  [      2.51]
pkt_dlt:codec/1                                                            6500   0.05    16823  [      2.59]
pkt_ipproto:codec/1                                                        6500   0.05    17206  [      2.65]
defrag_worker:send_packet/2                                                3248   0.05    18095  [      5.57]
pkt:ether_type/1                                                           6500   0.05    18949  [      2.92]
epcap_port_lib:header/1                                                    3250   0.05    19190  [      5.90]
epcap_port_lib:tcp_flags/1                                                 3250   0.05    19954  [      6.14]
epcap_port_lib:decode/2                                                    3250   0.06    22116  [      6.80]
defrag_server:get_connection_worker_pid_by_address_tuple/2                 3254   0.06    22749  [      6.99]
pkt_ether:codec/1                                                          6500   0.06    23333  [      3.59]
parser_combinator_bitstring:pBetweenN/3                                    4084   0.07    24725  [      6.05]
gen_server:handle_msg/5                                                    7334   0.07    25430  [      3.47]
gen_server:decode_msg/8                                                    7334   0.07    25797  [      3.52]
lists:reverse/2                                                            9747   0.07    26658  [      2.73]
parser_combinator_bitstring:pThen/2                                        4084   0.08    28802  [      7.05]
parser_combinator_bitstring:pBinarystringCaseInsensitive/1                 4084   0.08    29844  [      7.31]
lists:reverse/1                                                            9750   0.08    30326  [      3.11]
pkt:dlt/1                                                                  6500   0.08    30687  [      4.72]
lists:member/2                                                            13000   0.09    31471  [      2.42]
parser_combinator_bitstring:pBinarystring/1                                4084   0.09    32389  [      7.93]
defrag_server:payloadsize/2                                                3250   0.09    32952  [     10.14]
content_worker:handle_call/3                                               4084   0.09    34026  [      8.33]
pkt_tcp:options/2                                                         12995   0.10    34730  [      2.67]
pkt:decapsulate/1                                                          6500   0.10    35189  [      5.41]
inet_parse:ntoa/1                                                          6500   0.10    35428  [      5.45]
pkt:ether/1                                                                6500   0.10    36081  [      5.55]
parser_combinator_bitstring:'-pUntilN/2-fun-0-'/3                          4086   0.10    36959  [      9.05]
epcap_port_lib:port/2                                                      6500   0.10    36995  [      5.69]
pkt_tcp:codec/1                                                            9750   0.11    38266  [      3.92]
parser_combinator_bitstring:parse/2                                        8168   0.11    40590  [      4.97]
pkt:makesum/1                                                              6500   0.11    40611  [      6.25]
lists:keyfind/3                                                           16250   0.12    42029  [      2.59]
pkt:ipv4/1                                                                 9750   0.14    49418  [      5.07]
parser_combinator_bitstring:pUntilN/2                                      8168   0.14    51463  [      6.30]
gen_server:reply/2                                                         4084   0.14    52419  [     12.84]
pkt:checksum/1                                                            13000   0.14    52540  [      4.04]
pkt_ipv4:codec/1                                                           9750   0.15    53636  [      5.50]
erlang:'++'/2                                                             19502   0.16    57459  [      2.95]
pkt:tcp/1                                                                  9750   0.18    64046  [      6.57]
pkt:decapsulate_next/2                                                    26000   0.18    64203  [      2.47]
pkt:next/1                                                                13000   0.19    69929  [      5.38]
gen_server:loop/6                                                          7333   0.20    73841  [     10.07]
epcap_port_lib:'-tcp_flags/1-lc$^0/1-0-'/1                                29250   0.23    83148  [      2.84]
erlang:integer_to_list/1                                                  26001   0.24    88303  [      3.40]
gen_server:dispatch/3                                                      3250   0.28   101920  [     31.36]
gen_fsm:send_event/2                                                       3248   2.38   865274  [    266.40]
parser_combinator_bitstring:untilN/4                                     412114   3.94  1432868  [      3.48]
parser_combinator_bitstring:match/3                                      413700   4.41  1602864  [      3.87]
parser_combinator_bitstring:'-pBinarystring/1-fun-0-'/2                  412084   8.17  2974405  [      7.22]
pkt:compl/2                                                             1635928  10.43  3793934  [      2.32]
lists:foldl/3                                                           1642427  10.49  3816434  [      2.32]
pkt:compl/1                                                             1635927  10.53  3830908  [      2.34]
pkt:'-checksum/1-fun-0-'/2                                              1635928  20.27  7376856  [      4.51]
pkt:'-checksum/1-blc$^0/1-0-'/1                                         1642936  24.18  8798810  [      5.36]

That's quite interesting. If you look at the last column of the call that takes the most time (the binary list comprehension), it is much faster than, for example, parser_combinator_bitstring:untilN/4 (5.36 vs 266.40 us). As I'd expect. checksum/1 is just being called a lot (twice per packet?).

I haven't tried to do any optimization so there is room for improvement. But since pkt_ipv4:codec/1 and pkt:tcp/1 have roughly the same times as checksum/1, not sure how much improvement. Even the BIFs like erlang:'++' and lists:keyfind/3 are around 2.5us.

So maybe using an NIF will half the time.

Does it make sense to checksum every packet? The kernel will drop packets with failed checksums, so the application won't see them.

1) The incomming packets are passed from epcap to defrag. (Defrag should be better called "Stream" to stay in suricatas vocabulary, I will rename it soon). "Stream" checks for every incoming packet for checksum and drops those with invalid checksums. Then it forwards it to the corresponding process of that connection. Which then tracks the SYNs, Fins and sequence numbers from both sides, drops packages which come out of sequence, buffers out of order packages, reassembles their data as soon as these are Acknowledged from the peer side. Acked data is buffered upon pieces of 1500 Bytes and then forwarded to content server for pattern matching on strings in part of the 1500 bytes buffer.

2) 266.40 ms apples for gen_fsm:send_event/2 not for parser_combinator_bitstring:untilN/4.

3) pkt:makesum is called 6500 times. Makesum seems to call checksum twice from pkt internally, as checksum is called 13000 times. Assumed, that 1/2 or 1/3 of the packages are ACKs, ca. 3250 to 4333 packages of 1440 bytes have been received in download direction and are forwarded by gen_fsm:send_event/2 as 1500 byte packages to the content server.

4) Yes. It makes sense to checksum every packet. Tracking Sequence Numbers as done by Stream does not work when broken packages are around. But "Stream" is only necessary, if the data is intentionally sent in small pieces, which are made so small that pattern matching does not work or mixed up with broken packages which shall confuse the IDS.

5) I am not yet sure, whether a checksum NIF should be called for every packet or be integrated into epcap and drop all packages with broken checksum.

6) Any idea if http://erllvm.softlab.ntua.gr/ would bring any improvements? Are the NIFs compatible?

Note:

This data is still based on old dev branch of pkt.

Sorry about the mistake in reading the chart.

The usual strategies apply: minimize the number of checksum operations, reduce the amount of state and optimize.

If you are using hardware offloading, your process will not need to handle the checksums.

If you want to handle the checksums yourself, maybe there is a pcap expression that can minimize some of the packets passed to you.

You can experiment with timings for different checksums implementations. Patches welcome!

No idea about the LLVM port. You can try compiling with HIPE.

1) First source of performance leak was identified: I had running the observer on the same erlang shell like enose. See here is how to run observer correctly: http://www.erlang.org/doc/apps/observer/observer_ug.html This improves the performance significantly, from almost 100% load on 2 processors to 30 % load on 2 processors. 2) Using Hipe: Hipe did not bring any performance improvements..

3) Performance improvements to checksum calculation: See: www.erlang.se/workshop/2005/tcpip_presentation.pdf‎

1) Here is a proposal based on the second algorithm at: http://locklessinc.com/articles/tcp_checksum/

checksum(Bin) ->
    checksum(Bin, 0).

checksum(<<N1:64/integer, N2:64/integer, N3:64/integer, N4:64/integer, N5:64/integer, N6:64/integer, N7:64/integer, N8:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N1+N2+N3+N4+N5+N6+N7+N8+Checksum128Bit);

checksum(<<N1:64/integer, N2:64/integer, N3:64/integer, N4:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N1+N2+N3+N4+Checksum128Bit);

checksum(<<N1:64/integer, N2:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N1+N2+Checksum128Bit);

checksum(<<N:64/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N+Checksum128Bit);

checksum(<<N:16/integer, ReminderBin/binary>>, Checksum128Bit) ->
    checksum(ReminderBin, N+Checksum128Bit);

checksum(<<N:8/integer>>, Checksum128Bit) ->
    checksum(<<>>, (N bsl 8)+Checksum128Bit);

checksum(<<>>, Checksum128Bit) ->
        Checksum64Bit = foldWithOverflow64(Checksum128Bit),
        Checksum32Bit = foldWithOverflow32(Checksum64Bit),
        Checksum16Bit = foldWithOverflow16(Checksum32Bit),
        Checksum16Bit.

foldWithOverflow64(A) ->
        C = A band 16#FFFFFFFFFFFFFFFF,
    D = (A bsr 64) band 16#FFFFFFFFFFFFFFFF,
    E = (C + D) band 16#FFFFFFFFFFFFFFFF,
        case E < D of
        true ->
            F = E + 1; % overflow
        false ->
            F = E
         end,
         F.

foldWithOverflow32(A) ->
        C = A band 16#FFFFFFFF,
    D = (A bsr 32) band 16#FFFFFFFF,
    E = (C + D) band 16#FFFFFFFF, 
        case E < D of
        true ->
            F = E + 1; % overflow
        false ->
            F = E
         end,
         F.

foldWithOverflow16(A) ->
        C = A band 16#FFFF,
    D = (A bsr 16) band 16#FFFF,
    E = (C + D) band 16#FFFF, 
        case E < D of
        true ->
            F = E + 1; % overflow
        false ->
            F = E
         end,
         F.         

makesum(Hdr) -> 
    (checksum(Hdr) bxor 16#FFFF) band 16#FFFF. % bitwise-complement

Here is the profiling:

....
****** Process <0.64.0>    -- 42.05 % of profiled time *** 
FUNCTION                                                      CALLS      %     TIME  [uS / CALLS]
--------                                                      -----    ---     ----  [----------]
pkt_tcp:'-options/2-blc$^0/1-0-'/1                             7059   0.16    41432  [      5.87]
pkt_ipproto:codec/1                                           53518   0.76   196379  [      3.67]
gen_server:decode_msg/8                                       53518   0.81   209445  [      3.91]
gen_server:handle_msg/5                                       53518   0.82   212114  [      3.96]
defrag_server:handle_info/2                                   53518   0.84   216585  [      4.05]
pkt:decapsulate/1                                             53518   0.85   219462  [      4.10]
pkt_ether:type/1                                              53518   0.86   221997  [      4.15]
pkt:ipproto/1                                                 53518   0.90   231042  [      4.32]
pkt_dlt:codec/1                                               53518   0.92   235651  [      4.40]
gen_server:handle_common_reply/6                              53517   0.92   237503  [      4.44]
pkt:ether_type/1                                              53518   1.01   258806  [      4.84]
pkt_ether:codec/1                                             53518   1.01   260973  [      4.88]
gen_server:loop/6                                             53517   1.15   295331  [      5.52]
pkt:ether/1                                                   53518   1.49   382783  [      7.15]
defrag_worker:send_packet/2                                   53518   1.65   424154  [      7.93]
gen_server:dispatch/3                                         53518   1.67   430923  [      8.05]
defrag_server:get_connection_worker_pid_by_address_tuple/2    53518   1.70   438353  [      8.19]
lists:reverse/2                                              107036   1.76   453021  [      4.23]
pkt:foldWithOverflow16/1                                     107036   1.76   454072  [      4.24]
epcap_port_lib:decode/3                                       53518   1.78   458708  [      8.57]
pkt:dlt/1                                                     53518   1.89   487963  [      9.12]
pkt_tcp:options/1                                             53518   1.91   491872  [      9.19]
lists:reverse/1                                              107036   1.93   497500  [      4.65]
defrag_server:payloadsize/2                                   53518   2.38   612710  [     11.45]
pkt_tcp:codec/1                                              107036   2.56   659537  [      6.16]
pkt_ipv4:codec/1                                             107036   2.88   741574  [      6.93]
pkt:ipv4/1                                                   107036   3.30   848647  [      7.93]
pkt:foldWithOverflow32/1                                     107036   3.30   848749  [      7.93]
pkt:next/1                                                   107036   3.33   857463  [      8.01]
pkt:decapsulate_next/2                                       214072   3.39   873092  [      4.08]
pkt:tcp/1                                                    107036   3.48   895183  [      8.36]
pkt:foldWithOverflow64/1                                     107036   3.59   925494  [      8.65]
pkt_tcp:options/2                                            219841   3.68   947388  [      4.31]
pkt:makesum/1                                                107036   3.77   970288  [      9.07]
gen_fsm:send_event/2                                          53518   4.00  1030410  [     19.25]
pkt:checksum/1                                               214072   5.50  1416203  [      6.62]
pkt:checksum/2                                              1248045  26.28  6767724  [      5.42]

Not even that pkt:checksum/2 is a little bit faster, also up to 16 times (in average 6-7 times) less called per received packet.

As this algorithm is much more complicated, should we add this under another name e.g. checksum fast or should the existing one be replaced? I will then create a branch and a pull request.

2) The average load goes down from 60% on one processor to 50 % on one processor.

It is unclear, whether the following rebar.config options in msantos/pkt have any performance impact. While the observer graphs do not show any significant difference, when deactivating those. It seems likely that at least activation of coverage analysis costs some performance.

pkt/rebar.config

{xref_checks, [undefined_function_calls]}.
{cover_enabled, true}.
{eunit_opts, [verbose, {report, {eunit_surefire, [{dir, "."}]}}]}.

As I know the {cover_enabled, true} option is related only to eunit. It creates the coverage report in html format after "rebar eunit" run.

@josemic your proposal looks great! That is quite an improvement. I think you should replace the checksum function with your version.

1) I have updated the checksum function above to the latest version. In average the package checksum should now be calculated in 5.42 * 23 = 124 Micro seconds compared with 1354.Micro seconds before.

2) I have updated to the latest version of msantos/pkt. I got the feeling that due to this update, the scheduler utilization has gone up from 50 % to 60 %. Any idea, if there were performance relevant changes?

3) I have added some test cases. When using binary values, I get differences for checksums using odd number of bytes. I assume the old checksum algorithm does not correctly calculate checksums, when feed with binary values of odd length directly instead of records. For the new checksum algorithm, we can remove padding of the #ip4- and #ip6-records.

4) I created a pull request

Nice work!
This is between the old dev version you were using and the current master? The old version of decapsulate/1 hard coded the next protocol types. The new version calls a function which calls out to a function in the codec module to look up the protocol. This might be causing the slow down you are seeing.
Could be. Think the old checksum was just a translation from C to Erlang of the checksum algorithm in BSD ping. Which padding are you referring to? The pad after the IPv4 options? I believe the padding is a requirement.
Thanks again!

Yes. Between the old dev version I was and the current master.
To the padding below, e.g. for UDP. Same for TCP.:

% UDP pseudoheader checksum
checksum([#ipv4{
        saddr = {SA1,SA2,SA3,SA4},
        daddr = {DA1,DA2,DA3,DA4}
    },
    #udp{
        ulen = Len
    } = Hdr,
    Payload
]) ->
    UDP = udp(Hdr),
    Pad = case Len rem 2 of
        0 -> 0;
        1 -> 8
    end,
    checksum(
        <<SA1,SA2,SA3,SA4,
          DA1,DA2,DA3,DA4,
          0:8,
          ?IPPROTO_UDP:8,
          Len:16,
          UDP/binary,
          Payload/bits,
          0:Pad>>
    );

I could not test this one, but this is how I understand RFC 793.:

checksum(<<N:8/integer>>, Checksum128Bit) ->
         checksum(<<>>, (N bsl 8)+Checksum128Bit);

I suspected there would be a performance impact with the inter-module calls. I planned on benchmarking the old vs new versions but didn't get around to it. I think there are could be some easy improvements, like inlining some of the functions or moving the functions back into the pkt module.
Ok, looks good to me!

I did some test with lining in pkt.erl, but I did not see any improvements.
I tired:
-compile(inline).
or:
-compile(inline).
-compile({inline_size,100}).

Just some feedback for the record:

Using erlang-hipe did not bring any speed improvements. (Updated the comment above.)
Purchased an Intel network card and installed PF_RING from here:
https://redmine.openinfosecfoundation.org/projects/suricata/wiki/Installation_with_PF_RING
safemedia.com/builds/UsersGuide.pdf
Did some optimizations in other parts of the program, which caused overload however have not been pointed out by eprof. After that the load was 40% on one core. The other core 0 % for 1.5 MByte/s throughput.
Using PF_RING with eth1 reduced the load furthermore from 30 to 40 percent.
While I did not get PF_RING network interface dna0 or dna1 working, the total throughput on my 6 years old machine on eth1 is up to 7-8 Mbyte/s.(All packets are checksum checked.)
When producing overload > 8Mbyte, looks like packets get lost:
Seeing message queue pile up and erlang:bif_return_trap gets called. (See image)
Do not find much about bif_return_trap. Except: http://permalink.gmane.org/gmane.comp.lang.erlang.general/65945 Is the bif_return_trap related to epcap or a general Erlang issue?

overloadcondition

msantos / pkt

Checksum calculation is slow #10