universal-automata / liblevenshtein

Various utilities regarding Levenshtein transducers.
https://github.com/universal-automata/liblevenshtein
MIT License
67 stars 13 forks source link

loading very large dictionaries #31

Closed remkoboschker closed 7 years ago

remkoboschker commented 8 years ago

Hi,

I would like to use this module to find spelling alternates for strings. My dictionary has about 4 million words in it. Building the dawg takes a long time and I would like to store it in something like redis. Could you provide any advice? Does the whole graph need to be loaded in memory or can I retrieve nodes or part of the graph as alternates are accepted?

I'm using the javascript library. Perhaps any Java functionality needs porting to facilitate it?

dylon commented 7 years ago

Hi @remkoboschker,

  1. "Building the dawg takes a long time"
    • You are probably building the DAWG from an unsorted list, or are not specifying that the list has already been sorted. As described by the usage documentation, unless you specify otherwise the dictionary is assumed unsorted and will be sorted before constructing the DAWG.
    • If you pre-sort the list (lexicographically, case-sensitive, in ascending order) then you should pass true as the second parameter to liblevenshtein.Builder#dictionary(Array, Boolean).
  2. "I would like to store it in something like redis"

    • The library does not support this directly, but you could use the low-level constructor for building a DAWG, pass the DAWG as the first parameter to liblevenshtein.Builder#dictionary(Array, Boolean), and store the DAWG in Redis for reuse. When you need the transducer again, you may retrieve the DAWG from Redis and pass it directly to liblevenshtein.Builder#dictionary(Array, Boolean).

      • Before you construct the DAWG, be sure you sort the list lexicographically in case-sensitive, ascending order.
      • Assuming your sorted list of terms is called list, pass list as the singleton parameter to the constructor liblevenshtein.Dawg(Array), like so:
      var dawg = new liblevenshtein.Dawg(list); // or retrieve it from Redis
      var builder = new levenshtein.Builder().dictionary(dawg);
      var transducer = builder.transducer();
  3. "can I retrieve nodes or part of the graph as alternates are accepted?"
    • This case is not supported right now. You wouldn't want to use this approach unless you are dealing with larger-than-memory dictionaries, as it will seriously affect your performance (network latency + network hops + server overhead + database retrieval time, etc.)
    • I will support this eventually.
remkoboschker commented 7 years ago

Thank you for your replies. I'll get the list sorted, store the dawg and pass it to the constructor for faster setup. Does the java lib support updating the dawg incrementally as mentioned in the #13 and #14 3.0 milestone enhancements?

dylon commented 7 years ago

Not yet, but that will be one of the next features I add.