qlibs / mph

C++20 [Minimal] Static Perfect Hash library
163 stars 8 forks source link

examples of backup policy where keys are greater than uint128_t #5

Open rmanaloto-tastytrade opened 2 months ago

rmanaloto-tastytrade commented 2 months ago

in the FAQ it mentions a backup policy/customization point can be setup for the lookup implementation. Do you have examples of this. Your template metaprogramming skills are way above mine, so I'm not quite sure how to pull it off. Seems like I need a specialization on the find/lookup method?

And there's a section in the FAQ also regarding setting the probability and cache size alignment. Do you have examples for this also?

This is currently, an example where i need the workaround (I'm working on binary encoder/decoder for the CME exchanges binary protocol (simple binary encoding/iLink3)

https://godbolt.org/z/GxTeabvdx

we chatted previously regarding your reflect library This is the second part of where i am going w my implementation. I am using the reflect library to get me the total number of enums. I will then pair that with another compile time map where I can override the to_string output based on different systems. I just hit this snag so I just isolated it down to where it is having compilation issues.

krzysztof-jusiak commented 2 months ago

Thanks for the info, the compilation issues are because the mask can't be found at compile-time for given enum strings as they are too long, there is a way to customize, but not on underlying element bases bur rather of entry set properties - https://godbolt.org/z/dnc4x8hTT.

Anyway, the way I'd approach this particular example is by just using last 8 characters and just grab as each message can be uniquely identified by that which can be checked at compile-time.

Long story short, the following is a simple example how it can be approached.

// get the short enum names

template<class T, std::size_t N = sizeof(std::uint64_t)> requires std::is_enum_v<T>
constexpr auto enumerators = [] {
  std::array<std::pair<std::string_view, std::underlying_type_t<T>>, reflect::enumerators<T>.size()> enumerators{};
  for (auto i = 0u; const auto& [enumerator, name] : reflect::enumerators<T>) {
    const auto enum_name = std::string_view{name};
    enumerators[i++] = std::pair{enum_name.substr(enum_name.size() - N), enumerator};
  }
  return enumerators;
}();

// input data

template<auto Size, std::size_t N = sizeof(std::uint64_t)> // may need different overloads depending on the input data
[[nodiscard]] constexpr auto data(const char (&str)[Size]) { return &str[Size - N - 1u]; }

// safe lookup

inline constexpr auto find = mph::find<enumerators<MessageIds>>;

static_assert(0 == *find(data("Negotiate500")));
static_assert(1 == *find(data("NegotiationResponse501")));
static_assert(2 == *find(data("NegotiationReject502")));
static_assert(3 == *find(data("Establish503")));
static_assert(4 == *find(data("EstablishmentAck504")));
static_assert(5 == *find(data("EstablishmentReject505")));
static_assert(6 == *find(data("Sequence506")));
static_assert(7 == *find(data("Terminate507")));
static_assert(8 == *find(data("RetransmitRequest508")));
static_assert(9 == *find(data("Retransmission509")));
static_assert(10 == *find(data("RetransmitReject510")));
static_assert(11 == *find(data("NotApplied513")));
static_assert(12 == *find(data("NewOrderSingle514")));
static_assert(13 == *find(data("OrderCancelReplaceRequest515")));
static_assert(14 == *find(data("OrderCancelRequest516")));
static_assert(15 == *find(data("MassQuote517")));
static_assert(16 == *find(data("PartyDetailsDefinitionRequest518")));
static_assert(17 == *find(data("PartyDetailsDefinitionRequestAck519")));
static_assert(18 == *find(data("BusinessReject521")));
static_assert(19 == *find(data("ExecutionReportNew522")));
static_assert(20 == *find(data("ExecutionReportReject523")));
static_assert(21 == *find(data("ExecutionReportElimination524")));
static_assert(22 == *find(data("ExecutionReportTradeOutright525")));
static_assert(23 == *find(data("ExecutionReportTradeSpread526")));
static_assert(24 == *find(data("ExecutionReportTradeSpreadLeg527")));
static_assert(25 == *find(data("QuoteCancel528")));
static_assert(26 == *find(data("OrderMassActionRequest529")));
static_assert(27 == *find(data("OrderMassStatusRequest530")));
static_assert(28 == *find(data("ExecutionReportModify531")));
static_assert(29 == *find(data("ExecutionReportStatus532")));
static_assert(30 == *find(data("OrderStatusRequest533")));
static_assert(31 == *find(data("ExecutionReportCancel534")));
static_assert(32 == *find(data("OrderCancelReject535")));
static_assert(33 == *find(data("OrderCancelReplaceReject536")));
static_assert(34 == *find(data("PartyDetailsListRequest537")));
static_assert(35 == *find(data("PartyDetailsListReport538")));
static_assert(36 == *find(data("ExecutionAck539")));
static_assert(37 == *find(data("RequestForQuote543")));
static_assert(38 == *find(data("NewOrderCross544")));
static_assert(39 == *find(data("MassQuoteAck545")));
static_assert(40 == *find(data("RequestForQuoteAck546")));
static_assert(41 == *find(data("ExecutionReportTradeAddendumOutright548")));
static_assert(42 == *find(data("ExecutionReportTradeAddendumSpread549")));
static_assert(43 == *find(data("ExecutionReportTradeAddendumSpreadLeg550")));
static_assert(44 == *find(data("SecurityDefinitionRequest560")));
static_assert(45 == *find(data("SecurityDefinitionResponse561")));
static_assert(46 == *find(data("OrderMassActionReport562")));
static_assert(47 == *find(data("QuoteCancelAck563")));
static_assert(48 == *find(data("ExecutionReportPendingCancel564")));
static_assert(49 == *find(data("ExecutionReportPendingReplace565")));

Produced assembly

$find(unsigned long):                              # @"$find(unsigned long)"
        movabs  rax, 434315889064543778
        pext    rax, rdi, rax
        shl     eax, 4
        lea     rcx, [rip + mph::v5_0_1::type_traits::constant_t<mph::v5_0_1::find$pext<enumerators<MessageIds, 8ul>>>::value]
        cmp     qword ptr [rax + rcx], rdi
        sete    al
        ret
find:
    /// lookup table

That can be further optimized as I think that last 4 characters are enough in such case. The data call may need specialization on the input used. If only valid keys (only enum names) will be used that can be optimized further with probability 100 for find or with lookup. If input strings are string literals pointers could be used instead for comparison but it depends on the use case.

So the lookup (unsafe) would look as follows:

$lookup(unsigned long):                            # @"$lookup(unsigned long)"
        movabs  rax, 434315889064543778
        pext    rax, rdi, rax
        lea     rcx, [rip + lookup]
        movzx   eax, byte ptr [rax + rcx]
        ret
lookup:
        .ascii  "\000\000\000\000\000\000\000\t\000\000\b\000\000\000\001\000\000\000\000\000\000\000\n\000\000\000\020\021\000\000\000\000\000\000\000\000\000\000\022\000\000\000\032\000\000\000\031\000\000\000#\000\000\000\000\000\000\000\033\034\000\000\000$\000\000\000)\000\000*\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000+\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000,\000\000\000-\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\002\000\000\000\000\003\000\000\000\000\000\000\000\000\000\000\000\000\000\000\013\000\000\000\000\000\000\000\000\000\000\000\024\000\000\000\000\023\000\000\000\000\000\000\035\000\000\000\000\000\000\000\036\000\000\000\000\000\000\000\000\000\000\000\000\000%\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000.\000\000\000\000\000\000\000/\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\005\000\000\000\000\004\000\000\000\000\000\000\000\000\000\000\000\000\000\000\r\000\000\000\f\000\000\000\000\026\000\000\000\000\000\000\025\000\000\000\000\000\000\000\000\000\000\000 \000\000\000\000\000\000\000\037\000\000\000\000\000\000&\000\000\000\000'\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\0001\000\000\0000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\006\000\000\000\007\000\000\000\000\000\000\000\000\000\017\000\016\000\000\000\000\000\000\000\000\000\000\000\027\000\000\000\030\000\000\000\000\000\000\000\000\000\000\000!\000\000\000\"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000("

Full example - https://godbolt.org/z/rrbfMrY66

If you are comfortable with last 4 characters that will make the lookup tables a bit smaller but it's also a bit more unsafe - https://godbolt.org/z/vsbTqaKoK

The following will the x86-64 assembly for gcc

$find(unsigned int):
        mov     edx, 252116992
        pext    edx, edi, edx
        mov     edx, edx
        movzx   eax, BYTE PTR mph::v5_0_1::type_traits::constant_t<mph::v5_0_1::find$pext<enumerators<MessageIds, 4ul> > >::value[4+rdx*8]
        cmp     DWORD PTR mph::v5_0_1::type_traits::constant_t<mph::v5_0_1::find$pext<enumerators<MessageIds, 4ul> > >::value[0+rdx*8], edi
        mov     edx, 0
        cmovne  eax, edx
        ret
$lookup(unsigned int):
        mov     eax, 252116992
        pext    edi, edi, eax
        mov     edi, edi
        movzx   eax, BYTE PTR lookup[rdi]
        ret

Full example - https://godbolt.org/z/83vGhY7M8

rmanaloto-tastytrade commented 2 months ago

Thank you very much for the quick response. I'm reviewing this now and will give it a shot later today. I'll send back an update afterwards.

Have you considered exposing a runtime version of this library? It seems like most of the functionality is there to support a runtime version (with maybe a slightly slower performance). One use case for a runtime version for trading is where the universe of securities to trade on an exchange/venue is not known at compile time, but at system startup. The full security list is usually loaded up via a config file or loading it up via a market data feed at the start of the session.

So having a universal api to handle both use cases would be super useful. We do the runtime mph for the security list using the cmph library. It works, but that library is pretty old and being able to take advantage of your pext logic for small strings (equity symbols should satisfy the uint128_t requirement) would help optimize our lookups.