rsmmr / hilti

**NOTE**: This is outdated and no longer maintained. There's a new version at https://github.com/zeek/spicy.
Other
40 stars 22 forks source link

Use __hlt_bytes_find_bytes to translate '&until' spicy attribute ? #22

Open FrozenCaribou opened 8 years ago

FrozenCaribou commented 8 years ago

Hello !

I am currently testing CPU performance of parsers and i think found an improvement for &unti spicy attribute.

Here a representative parser of my test :

module Foo;

export type Test = unit {
  data : bytes &until("\x00");
};

I notice the processing time is quite sensitive (nearly x2) to the payload size because of &until("\x00") when "\x00" delimiter is not present.

I see that the &until attribute is translated into unpack (data, cur) Hilti::Packed::BytesDelim "\x00" which is transformed into a llvm loop composed of llvm basic block. https://github.com/rsmmr/hilti/blob/master/hilti/codegen/unpacker.cc#L134-L210

At each loop step tree functions are called:

The pseudo code is :

while 1 {
  if ( hlt_iterator_bytes_eq(cur, END_BYTE))
     //  => raise Hilti::WouldBloc
     break;
  else{
    if hlt_bytes_match_at(cur, delim)
      // => Find !
      break;
    else
      hlt_iterator_bytes_incr(cur) # cur = cur+1
  }
}

Even through these functions are "inlines", the loop assembly is quite heavy to just search a string pattern in a bytes.

I saw that hlt_bytes_find_byte_from [EDIT : `hlt_bytes_find_bytes] exists to search a string pattern, it is written in C and usememchrof libc to search a char.memch` seems to be optimized for performance (better than a simple loop on each char) https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.c#L1019-L1029 http://code.metager.de/source/xref/gnu/glibc/string/memchr.c

Before trying to use __hlt_bytes_find_bytes for &until i would like to know if there is maybe a reason that it is not used in the unpacker.cc ?

rsmmr commented 8 years ago

Yeah, quite heavy for sure in this case. Iirc, the main reason for this is supporting searching for more than a single byte. __hlt_bytes_find_byte_from looks only for a single byte. If the code generator can guarantee that the delimiter is indeed just exactly one byte, switching over should be fine (i.e., it needs to be a byte constant of length 1).

More generally, there are plenty cases in the code generator right now where it picks the most generic/easy path but could optimize further for special cases where it knows more. There's also a bit handwaving that LLVM's optimization might take care of some stuff which at first appears inefficient (might be interesting to look at the actual LLVM bit code generated here after optimization).

One more note: are you confident it's indeed the searching that's expensive? The data : bytes &until("\x00"); construct is expensive is a different way too: it needs to buffer al the data for later access. If's the matching input is long, that can be significant. A way to avoid that is using &transient, which doesn't store the whole thing but still gives access to chunks as they come in.

FrozenCaribou commented 8 years ago

I did a mistake in my previous post, in fact it is hlt_bytes_find_bytes (or hlt_bytes_find_bytes_at_iter to use with a byte iterator) and not __hlt_bytes_find_byte_from. It seems they can be used with bytes pattern. https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.h#L192-L213

Effectively for LLVM optimization, I did not check how it optimizes it. In my case I just did a callgrind analysis and compared the difference of instruction fetches between a small payload and a big one (A simple TFTP request with no option (delimited by \x00) and a big one with an option but without \x00). I will try the &transient to see the difference and also check if it is really the searching which is expensive.

Today I have started to implement a quick patch to use hlt_bytes_find_bytes, for the moment it works with a hlt script but I did not success yet to test it with a spicy analyzer (need to convert a hilti string into byte before calling the function). I am not very confortable with llvm code generation ^^

I will give you more information about results and perform test comparisons.

FrozenCaribou commented 8 years ago

Hello Robin,

I have a first version patch using hlt_bytes_find_bytes :) ! Here my perfomance results:

For this test I wrote a simple analyzer:

module FOO;

export type Test = unit {
  data: bytes &until(b"\x00");
};
"A" "A"*1300 Diff perf % IR "A" IR "A"*1300 Diff IR %
master 11135.73869 98344.34831 783.1416673 34766 311453 795.8551458
master &transient 10924.07524 91367.08665 736.3828029 32490 309177 851.6066482
hlt_bytes_find_bytes 10365.35155 10746.25974 3.674821712 37305 37690 1.03203324
hlt_bytes_find_bytes &transient 9675.547945 9716.534161 0.4236061515 35033 35418 1.098963834

I perfomed 4 differents tests:

Time measure

The two first rows contain the time in nanosecond. It is an average on about 1000 packets. I measured it by getting the time before and after the parsing call function: void* pboj =(*endp->parser->parse_func)(endp->data, &endp->cookie, &except, ctx)

Instruction fetches

The "IR" rows contain the number of instruction fetches regarding the ".hlt.foo_parse_Test_internal" that executes protocol decoding. It given by kcachegrind with a valgrind record. When I measure "IR" I use only one packet because the result is identical for a same packet (same execution path)

Payloads I sent were just "A" string without the delimiter "\x00" : a small packet just the character "A" and a big packet "AAAAA..." 1300 times.

Interpretation

I notice that time measures can change between two same measures (the cpu load of the system is not constant) but the ratio "Diff perf %" is the same order of magnitude (about +700%) It seems the time increase between "A" and "A"*1300 rows depends of the difference of instruction fetch which is logical.

I also checked a delimiter with several bytes and it works correctly.

patch

static void _bytesUnpackDelim(CodeGen* cg, const UnpackArgs& args, const UnpackResult& result, bool skip)
{
    auto iter_type = builder::iterator::typeBytes();
    auto bytes_type = builder::reference::type(builder::bytes::type());

    auto delim = builder::codegen::create(bytes_type, args.arg);
    auto begin = builder::codegen::create(iter_type, args.begin);
    auto end = builder::codegen::create(iter_type, cg->llvmIterBytesEnd());

    auto block_body = cg->newBuilder("unpack-loop-body");
    auto block_exit = cg->newBuilder("unpack-loop-exit");
    auto block_insufficient = cg->newBuilder("unpack-loop-insufficient");

    // Copy the start iterator.
    cg->llvmCreateStore(args.begin, result.iter_ptr);
    cg->llvmCreateBr(block_body);

    cg->pushBuilder(block_body);

    auto cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));

    CodeGen::expr_list params = { cur, delim };
    auto find_result  = cg->llvmCall("hlt::bytes_find_bytes_at_iter", params);
    auto next = cg->llvmTupleElement(iter_type, find_result, 1, false);
    cg->llvmCreateStore(next, result.iter_ptr);
    //auto success = cg->llvmTupleElement(??????,find_result, 0, false);

    cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));

    params = {cur, end};
    auto done = cg->llvmCall("hlt::iterator_bytes_eq", params);
    cg->llvmCreateCondBr(done, block_insufficient, block_exit);
    cg->popBuilder();

    // Not found.
    cg->pushBuilder(block_insufficient);
    cg->llvmRaiseException("Hilti::WouldBlock", args.location);
    cg->popBuilder();

    // Loop exit.
    cg->pushBuilder(block_exit);

    if ( ! skip ) {
        // Get the string up to here.
        cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
        params = { begin, cur };
        auto match = cg->llvmCall("hlt::bytes_sub", params);
        cg->llvmCreateStore(match, result.value_ptr);
    }

    else {
        auto init_val = cg->llvmInitVal(bytes_type);
        cg->llvmCreateStore(init_val, result.value_ptr);
    }

    // Move beyond delimiter.
    params = { delim };
    auto l = cg->llvmCall("hlt::bytes_len", params);
    auto len = builder::codegen::create(builder::integer::type(64), l);
    cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
    params = { cur, len };
    next = cg->llvmCall("hlt::iterator_bytes_incr_by", params);
    cg->llvmCreateStore(next, result.iter_ptr);

    // Leave builder on stack.
}

It could be improved and be cleaned up:

rsmmr commented 8 years ago

Nice, thanks for measuring! Feel free to file a merge request. I'm a bit behind with with Spicy/HILTI stuff due to time constraints but will aim to take a closer look then.