Add experimental CPU optimizations

This PR introduces a new CPU execution mode (Herring) that significantly improves throughput and latency performance on CPU, especially for low-latency scenarios. This mode operates by translating trees to a more cache-friendly format, minimizing branching during evaluation, and aligning data accessed by individual threads to cache line boundaries. Between 4 and 8 nodes fit in a cache line, and one child of each node is stored immediately after it in memory in order to increase the likelihood that a node traversal step reads from the same cache line for both parent and child.

Because benchmarking was a key element of this work, some updates were made to the benchmarking scripts as well in order to make their output more relevant and easier to understand.

Probably the most difficult element of this PR to review will be the mechanism for moving from Treelite's system of dispatching function calls based on enums representing types to a std::variant of the many possible type combinations which the new execution mode makes use of.

The short explanation of this is that during conversion of Treelite trees, Herring determines the smallest possible type to be used for some tree node attributes and then defers to Treelite's own type dispatch system to determine the rest. The converted model is returned as a std::variant, and std::visit is used at the outermost level of predict calls to perform inference on the correct variant. Further explanation is provided in in-line comments.

Herring does not currently support categorical features; this will be left for a follow-on PR. Similarly, Herring currently only accepts and returns floats and evaluates models in the "native" precision indicated by Treelite. Future PRs will allow double precision I/O and will allow threshold values to be cast to lower precision for speed.

An example of the performance improvement provided by Herring for a specific model is attached. The drop in throughput performance for both GTIL and Herring at the end of the curve is an artifact of discretization in the benchmarking scripts and can be disregarded. Extending the range and granularity of the benchmarking parameter sweep would cause both curves to plateau near the same value for high enough latency.

final_comparison

triton-inference-server / fil_backend

Add experimental CPU optimizations #203