mckoss / dawg

Directed Acyclic Word Graph
MIT License
41 stars 2 forks source link

Thoughts on how to architect a DAWG-like system for Turkish (an agglutinative language with possibly infinite number of words)? #6

Open lancejpollard opened 10 months ago

lancejpollard commented 10 months ago

You're probably familiar with agglutinative languages, like Turkish, which can have "sentence words", basically an infinite number of combination of base + n suffixes. Navajo is another example language, or Finnish, Inuktitut, etc.. Turkish words can be at least up to 70 characters, but you never probably want to serialize this word into a trie/dictionary because there would be way too many trie nodes I would imagine.

Sorry to bother you, but you seem like you are a master at this so I thought I'd ask you directly :). How would you architect such a DAWG-like system for Turkish, assuming let's say 10,000 base words, and 1000 suffixes, which can be concatenated infinitely.

Seems like you would not have to serialize/realize the actual TrieNodes, but make it virtualized somehow, and I'm having a tough time imagining how that might work/look. Maybe you have some thoughts or have done this sort of thing before. But it's like, all 11000 fragments (words + suffixes) would be in the central trie. The words would then have as leaves a marker that it was a word itself, plus link to the suffix nodes, so forming a loop. The suffixes would link to suffixes as well. How would that look? Is that something possible even, given the added thing of the "loop"? If it's possible, it doesn't seem straightforward to serialize anymore because of the loops, is it still possible to pack and unpack?

Thanks in advance for the help.

Best, Lance