Closed mySYSMON closed 3 years ago
Hi @mySYSMON,
The PGM-index is made for integers that fit into a computer word. So I’m not sure it’s the right tool for your problem.
In theory, you could:
uint128_t
integers that works for a prefix of up to log_38(2^128) = 24 characters of a domain.*Then, at query time, you would:
But this is could be a contrived solution for your problem. For long strings, you may want to use a hash table or a trie instead, depending on the kind of searches you want to do. E.g. If they are exact matches then use a hash table (e.g. Python’s set
or C++’s std::unordered_set
).
Hope this helps.
*I’m assuming standard domains (and not IDNs) that use ASCII letters, digits, hyphens and dots, so an alphabet of 38 characters.
great answer, thanks!
My pleasure 😊
I tried to convert my 125 million domains to a unique set of integers but the integer values exceeded the max for 64bit ints. Does anyone know a way to solve this? Maybe something obvious I am not seeing.