qlibs / mph

C++20 [Minimal] Static Perfect Hash library
176 stars 9 forks source link

error when the first eight bytes of the key values are the same. #3

Closed liziyi75 closed 3 months ago

liziyi75 commented 5 months ago

https://godbolt.org/z/5xq33Y1T4

kris-jusiak commented 5 months ago

Thanks for the repro @liziyi75. Long story short it's backend problem. The front end handles keys properly however, the mask, ATM, only supports 64 bits as there is no pext128 and therefore backend pext implementation has to do some additional work to make that work. Not sure what the best solution here yet as there are many options so it needs some more digging but it should be possible with minimal overhead and some compile-time work.

Also, notice that the issue is only triggered if the mask requires all 64 bits to find a unique set before (not) noticing that a single 65 bit would be enough in such case.

So the following would be okay as keys do differ before the 64 bit.

"01234567890"
"012340678"
kris-jusiak commented 5 months ago

After giving it some thought I think it can be done with avx512 bmi2 which I think would also allow to have up to 64 characters, though I doubt constexpr abstract machine would handle 512 bits in reasonable time. So in short, pext32, pext64, pext.avx512.bmi2.compress, ... Don't have a proof of concept yet, but I'm pretty sure that's possible. There are also some options for ARM's vectorization.

liziyi75 commented 5 months ago

Thanks for providing the excellent library, it looks very close to perfection.

kris-jusiak commented 5 months ago

Thanks for suggestions. Indeed index is a commonly used approach. mph supports that (when probability is set to 100) which won't do the cmp against the input key and just return a likely index. Getting minimal perfect hashing index is the default but it could be not minimal too but the range could be big depending on the computed mask. Agree, the users can always just hash into u64 and compare the outcome which is a supported workflow with mph too. Though, there are always performance trade-offs to be made in such case, therefore mph support both approaches. The fastest way is to get the key and value in a single lookup but that's not always the best based on the whole app structure so different approaches needs supporting, IMHO. The other thing is that, one can put actual object and/or function (with musttail) for the key instead of an index to avoid double lookups too, so it depends. I really like the idea of simplicity and small APIs but also the max performance based on the use case whether it's dispatching, keyword matching, order book handling, enum printing, whatever, it seems like different approach is required. But very interesting in exploring flexible ideas and index based approach is defo workflow to be always supported by mph.

Some examples of supported workflows:

index for ints from strings - https://godbolt.org/z/f8354MGn8 index for ints from ints - https://godbolt.org/z/zhbe89rdY enum_to_string - https://godbolt.org/z/sx1MKhz9G string_to_enum - https://godbolt.org/z/Kfff75x18

liziyi75 commented 5 months ago

I completely agree with your point, and I also appreciate simplicity and efficiency, which is why I have been following you. In fact, I am a heavy user of your sml/sml2 library, and I am very grateful for the concise and fast experience it provides, including running and compiling speeds.

I see a lot of good ideas in mph, and I definitely hope to be able to find what I need in one lookup. However, there may be some cases where it is difficult to achieve this. mph::lookup has many limitations for non-type template parameters. For example, it cannot use variables as values. Additionally, there are errors when the value is a string_view or a function pointer: https://godbolt.org/z/MM3hn8Ws6 https://godbolt.org/z/hqhxGTPo6

kris-jusiak commented 5 months ago

I totally agree and I think we are on the same page but I'm not sure what is being proposed here and how to approach it? Under the hood keys are always integers (there is mapping function - https://github.com/boost-ext/mph/blob/main/mph#L222 - which converts any input keys to int32 or int64), values are whatever has been passed. Obviously, that can be done purely with integers as well - which is supported by mph - but the tricky part is that converting string like keys to underlying integers is not trivial to make fast, therefore mph provides functionality for that. What improvement could we make to make it better?

Examples mentioned are working (they were working on clang but there was issue with gcc when passing to optional)

liziyi75 commented 5 months ago

My suggestion is to provide an overloading mechanism for special types provided by users. For example, a user overloads the mph::detail::to function for a long string key value, the user may call the existing mph::detail::to function twice and form the key value by xoring the two results. The user is responsible for ensuring the uniqueness of the key value.

kris-jusiak commented 5 months ago

I see, thanks for clarifying, it makes sense to me. Will expose needed customization points for that to work, thanks for suggesting that. The lookup itself can be already can be overloaded in case there is a better solution available for given keys - https://godbolt.org/z/jnvb7xaPW but sometimes is better to handle the keys differently so agree on that. Will also move required functions like to to the main namespace to avoid users messing with details. Thanks.

kris-jusiak commented 5 months ago

mph::to is now exposed and can be overloaded for specific cases. Both: integer or string values can be passed to the lookup call and required conversion will happen if needed. Added an example of to customization point - https://godbolt.org/z/Evhc57dGK.

liziyi75 commented 5 months ago

Thank you very much! Additionally, if you could add another interface: mph::lookup(auto key, auto& values_array)), to support looking up variable values, then all of my current use cases would be covered. Thanks again!

kris-jusiak commented 5 months ago

Any particular reason not to use the index instead as shown below? It's to avoid the lookup? If so, there is also a way for lookup to return reference as that the keys can be modified but it's a bit odd as it's constexpr.

int main(int, const char** argv) {
  static constexpr std::array keys{
    "foo"sv, "bar"sv,
  };

  std::array values {
    1, 2
  };

  values[*mph::lookup<keys>(std::string_view{argv[1]})]++;

  return values[0] + values[1];
}

https://godbolt.org/z/oqcKeTPKn

For branchless usage with potentially not found keys

int main(int, const char** argv) {
  static constexpr std::array keys{
    std::pair{"foo"sv, 1}, std::pair{"bar"sv, 2},
  };

  std::array values {
    0/*not found case*/, 1, 2,
  };

  values[*mph::lookup<keys>(std::string_view{argv[1]})]++;

  return values[1] + values[2];
}

https://godbolt.org/z/GnEPe74qM

liziyi75 commented 5 months ago

I prefer the following:

bool lookup_map<auto kv_array>(auto key, [](auto value){...}) or
bool lookup_map<auto keys_array>(auto key, auto& values_array, [](auto value){...})

This can clearly express the logical relationships.

liziyi75 commented 5 months ago

After giving it some thought I think it can be done with avx512 bmi2 which I think would also allow to have up to 64 characters, though I doubt constexpr abstract machine would handle 512 bits in reasonable time. So in short, pext32, pext64, pext.avx512.bmi2.compress, ... Don't have a proof of concept yet, but I'm pretty sure that's possible. There are also some options for ARM's vectorization.

Implementing a long string as a key may seem challenging, but I am still looking forward to seeing you achieve it. This way, we can confidently use long strings as keys without worrying about any issues. Sorry for my poor english.