gvinciguerra / PGM-index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes
https://pgm.di.unipi.it/
Apache License 2.0
775 stars 91 forks source link

Recommendations for large list of strings? #8

Closed mySYSMON closed 3 years ago

mySYSMON commented 3 years ago

I tried to convert my 125 million domains to a unique set of integers but the integer values exceeded the max for 64bit ints. Does anyone know a way to solve this? Maybe something obvious I am not seeing.

gvinciguerra commented 3 years ago

Hi @mySYSMON,

The PGM-index is made for integers that fit into a computer word. So I’m not sure it’s the right tool for your problem.

In theory, you could:

  1. Store the domains into a sorted array A.
  2. Create a mapping from strings to uint128_t integers that works for a prefix of up to log_38(2^128) = 24 characters of a domain.*
  3. Build the PGM-index on the integers returned by the mapping above applied to the strings in A truncated to 24 characters.

Then, at query time, you would:

  1. Use PGM-index to get the starting position P into A of the domains that share a prefix (of up to 24 characters) with the given domain.
  2. Scan A (or better, use an exponential search) starting from the position P to find the domain you are looking for.

But this is could be a contrived solution for your problem. For long strings, you may want to use a hash table or a trie instead, depending on the kind of searches you want to do. E.g. If they are exact matches then use a hash table (e.g. Python’s set or C++’s std::unordered_set).

Hope this helps.

*I’m assuming standard domains (and not IDNs) that use ASCII letters, digits, hyphens and dots, so an alphabet of 38 characters.

mySYSMON commented 3 years ago

great answer, thanks!

gvinciguerra commented 3 years ago

My pleasure 😊