Closed ndvbd closed 5 years ago
Hi. The data structures storing the weights as counts support the lookup operation that, given a ngram, returns its associated weight. The data structures storing the weights as prob/backoff support, instead, the score operation. See details in the README (Benchmark section) and in the relevant paper. Currently, top-k queries are not supported and there is no python binding.
Thanks @jermp, how difficult would it be to modify things to support some top-k? The k doesn't have to be unlimited. For example if I have an ngram-prefix, finding the best 5 or 10 words right after the prefix would be highly useful.
It is not difficult, but currently the library does not support it yet. If you don't care about the efficiency of the operation, one option is to proceed by scan: retrieve all children of a given prefix and just select the top-k. For most queries, it will probably be efficiency too because of the few children to be examined. A more complicated solution could use additional RMQ data structures.
Thanks @jermp, I see the lookup() operation in trie_count_lm.h, but what's the right way to scan all children of a given prefix?
Currently, there is no method that allows you to do this but I can implement that.
That would be fantastic.
Hi! Can you list what operation this data structure provide? Is the data structure internally sorted by the weights, so we can easily, in O(k), find the k-top next tokens of an ngram prefix? Can this data structure be used from python?