Use __hlt_bytes_find_bytes to translate '&until' spicy attribute ?

Hello !

I am currently testing CPU performance of parsers and i think found an improvement for &unti spicy attribute.

Here a representative parser of my test :

module Foo;

export type Test = unit {
  data : bytes &until("\x00");
};

I notice the processing time is quite sensitive (nearly x2) to the payload size because of &until("\x00") when "\x00" delimiter is not present.

I see that the &until attribute is translated into unpack (data, cur) Hilti::Packed::BytesDelim "\x00" which is transformed into a llvm loop composed of llvm basic block. https://github.com/rsmmr/hilti/blob/master/hilti/codegen/unpacker.cc#L134-L210

At each loop step tree functions are called:

hlt_iterator_bytes_eq
hlt_bytes_match_at
hlt_iterator_bytes_incr

The pseudo code is :

while 1 {
  if ( hlt_iterator_bytes_eq(cur, END_BYTE))
     //  => raise Hilti::WouldBloc
     break;
  else{
    if hlt_bytes_match_at(cur, delim)
      // => Find !
      break;
    else
      hlt_iterator_bytes_incr(cur) # cur = cur+1
  }
}

Even through these functions are "inlines", the loop assembly is quite heavy to just search a string pattern in a bytes.

I saw that hlt_bytes_find_byte_from [EDIT : `hlt_bytes_find_bytes] exists to search a string pattern, it is written in C and usememchrof libc to search a char.memch` seems to be optimized for performance (better than a simple loop on each char) https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.c#L1019-L1029 http://code.metager.de/source/xref/gnu/glibc/string/memchr.c

Before trying to use __hlt_bytes_find_bytes for &until i would like to know if there is maybe a reason that it is not used in the unpacker.cc ?

Yeah, quite heavy for sure in this case. Iirc, the main reason for this is supporting searching for more than a single byte. __hlt_bytes_find_byte_from looks only for a single byte. If the code generator can guarantee that the delimiter is indeed just exactly one byte, switching over should be fine (i.e., it needs to be a byte constant of length 1).

More generally, there are plenty cases in the code generator right now where it picks the most generic/easy path but could optimize further for special cases where it knows more. There's also a bit handwaving that LLVM's optimization might take care of some stuff which at first appears inefficient (might be interesting to look at the actual LLVM bit code generated here after optimization).

One more note: are you confident it's indeed the searching that's expensive? The data : bytes &until("\x00"); construct is expensive is a different way too: it needs to buffer al the data for later access. If's the matching input is long, that can be significant. A way to avoid that is using &transient, which doesn't store the whole thing but still gives access to chunks as they come in.

I did a mistake in my previous post, in fact it is hlt_bytes_find_bytes (or hlt_bytes_find_bytes_at_iter to use with a byte iterator) and not __hlt_bytes_find_byte_from. It seems they can be used with bytes pattern. https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.h#L192-L213

Effectively for LLVM optimization, I did not check how it optimizes it. In my case I just did a callgrind analysis and compared the difference of instruction fetches between a small payload and a big one (A simple TFTP request with no option (delimited by \x00) and a big one with an option but without \x00). I will try the &transient to see the difference and also check if it is really the searching which is expensive.

Today I have started to implement a quick patch to use hlt_bytes_find_bytes, for the moment it works with a hlt script but I did not success yet to test it with a spicy analyzer (need to convert a hilti string into byte before calling the function). I am not very confortable with llvm code generation ^^

I will give you more information about results and perform test comparisons.

Hello Robin,

I have a first version patch using hlt_bytes_find_bytes :) ! Here my perfomance results:

For this test I wrote a simple analyzer:

module FOO;

export type Test = unit {
  data: bytes &until(b"\x00");
};

	"A"	"A"*1300	Diff perf %	IR "A"	IR "A"*1300	Diff IR %
master	11135.73869	98344.34831	783.1416673	34766	311453	795.8551458
master &transient	10924.07524	91367.08665	736.3828029	32490	309177	851.6066482
hlt_bytes_find_bytes	10365.35155	10746.25974	3.674821712	37305	37690	1.03203324
hlt_bytes_find_bytes &transient	9675.547945	9716.534161	0.4236061515	35033	35418	1.098963834

I perfomed 4 differents tests:

master : The current code
master &transient : The current code + add &transient to the Test.data field
hlt_bytes_find_bytes : _bytesUnpackDelim patched
hlt_bytes_find_bytes &transient : _bytesUnpackDelim patched + add &transient to the Test.data field

Time measure

The two first rows contain the time in nanosecond. It is an average on about 1000 packets. I measured it by getting the time before and after the parsing call function: void* pboj =(*endp->parser->parse_func)(endp->data, &endp->cookie, &except, ctx)

Instruction fetches

The "IR" rows contain the number of instruction fetches regarding the ".hlt.foo_parse_Test_internal" that executes protocol decoding. It given by kcachegrind with a valgrind record. When I measure "IR" I use only one packet because the result is identical for a same packet (same execution path)

Payloads I sent were just "A" string without the delimiter "\x00" : a small packet just the character "A" and a big packet "AAAAA..." 1300 times.

Interpretation

I notice that time measures can change between two same measures (the cpu load of the system is not constant) but the ratio "Diff perf %" is the same order of magnitude (about +700%) It seems the time increase between "A" and "A"*1300 rows depends of the difference of instruction fetch which is logical.

I also checked a delimiter with several bytes and it works correctly.

patch

static void _bytesUnpackDelim(CodeGen* cg, const UnpackArgs& args, const UnpackResult& result, bool skip)
{
    auto iter_type = builder::iterator::typeBytes();
    auto bytes_type = builder::reference::type(builder::bytes::type());

    auto delim = builder::codegen::create(bytes_type, args.arg);
    auto begin = builder::codegen::create(iter_type, args.begin);
    auto end = builder::codegen::create(iter_type, cg->llvmIterBytesEnd());

    auto block_body = cg->newBuilder("unpack-loop-body");
    auto block_exit = cg->newBuilder("unpack-loop-exit");
    auto block_insufficient = cg->newBuilder("unpack-loop-insufficient");

    // Copy the start iterator.
    cg->llvmCreateStore(args.begin, result.iter_ptr);
    cg->llvmCreateBr(block_body);

    cg->pushBuilder(block_body);

    auto cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));

    CodeGen::expr_list params = { cur, delim };
    auto find_result  = cg->llvmCall("hlt::bytes_find_bytes_at_iter", params);
    auto next = cg->llvmTupleElement(iter_type, find_result, 1, false);
    cg->llvmCreateStore(next, result.iter_ptr);
    //auto success = cg->llvmTupleElement(??????,find_result, 0, false);

    cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));

    params = {cur, end};
    auto done = cg->llvmCall("hlt::iterator_bytes_eq", params);
    cg->llvmCreateCondBr(done, block_insufficient, block_exit);
    cg->popBuilder();

    // Not found.
    cg->pushBuilder(block_insufficient);
    cg->llvmRaiseException("Hilti::WouldBlock", args.location);
    cg->popBuilder();

    // Loop exit.
    cg->pushBuilder(block_exit);

    if ( ! skip ) {
        // Get the string up to here.
        cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
        params = { begin, cur };
        auto match = cg->llvmCall("hlt::bytes_sub", params);
        cg->llvmCreateStore(match, result.value_ptr);
    }

    else {
        auto init_val = cg->llvmInitVal(bytes_type);
        cg->llvmCreateStore(init_val, result.value_ptr);
    }

    // Move beyond delimiter.
    params = { delim };
    auto l = cg->llvmCall("hlt::bytes_len", params);
    auto len = builder::codegen::create(builder::integer::type(64), l);
    cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
    params = { cur, len };
    next = cg->llvmCall("hlt::iterator_bytes_incr_by", params);
    cg->llvmCreateStore(next, result.iter_ptr);

    // Leave builder on stack.
}

It could be improved and be cleaned up:

remove the llvm block_body if possible
Get if succeed directly with cg->llvmTupleElement instead of of call hlt::iterator_bytes_eq which is useless because hlt::bytes_find_bytes_at_iter returns a struct containing if the pattern was found. I did not found the hilti::Type to use in llvmTupleElement. https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.h#L65-L68

Nice, thanks for measuring! Feel free to file a merge request. I'm a bit behind with with Spicy/HILTI stuff due to time constraints but will aim to take a closer look then.

rsmmr / hilti