renepickhardt / generalized-language-modeling-toolkit

Generalized Language Modeling toolkit
http://glm.rene-pickhardt.de
52 stars 17 forks source link

indexing the ngrams #6

Open renepickhardt opened 10 years ago

renepickhardt commented 10 years ago

it might be interesting already in this toolkit to index the ngrams using FSTs or trieBased solutions. This is something that we should discuss since this seems like a rather big step but it would increase the performance, reduce the storage needs and also make it easier to create applications out of the box since fast querying will be possible.

lschmelzeisen commented 10 years ago

If I understand this correctly, these are just performance optimizations so we are doing neither at the time and in the future have to choose a dataformat if we want to optimize?

renepickhardt commented 10 years ago

right we don't do that yet but I want to leave the issue open as this is an issue (enhancement)

lschmelzeisen commented 10 years ago

Potential bachelor thesis of mine. How to index and compress (skipped) ngrams.