jermp / tongrams

A C++ library providing fast language model queries in compressed space.
MIT License
128 stars 20 forks source link

Using Tongrams #1

Closed ndvbd closed 5 years ago

ndvbd commented 5 years ago

Hi! Can you list what operation this data structure provide? Is the data structure internally sorted by the weights, so we can easily, in O(k), find the k-top next tokens of an ngram prefix? Can this data structure be used from python?

jermp commented 5 years ago

Hi. The data structures storing the weights as counts support the lookup operation that, given a ngram, returns its associated weight. The data structures storing the weights as prob/backoff support, instead, the score operation. See details in the README (Benchmark section) and in the relevant paper. Currently, top-k queries are not supported and there is no python binding.

ndvbd commented 5 years ago

Thanks @jermp, how difficult would it be to modify things to support some top-k? The k doesn't have to be unlimited. For example if I have an ngram-prefix, finding the best 5 or 10 words right after the prefix would be highly useful.

jermp commented 5 years ago

It is not difficult, but currently the library does not support it yet. If you don't care about the efficiency of the operation, one option is to proceed by scan: retrieve all children of a given prefix and just select the top-k. For most queries, it will probably be efficiency too because of the few children to be examined. A more complicated solution could use additional RMQ data structures.

ndvbd commented 5 years ago

Thanks @jermp, I see the lookup() operation in trie_count_lm.h, but what's the right way to scan all children of a given prefix?

jermp commented 5 years ago

Currently, there is no method that allows you to do this but I can implement that.

ndvbd commented 5 years ago

That would be fantastic.