mckoss / dawg

Directed Acyclic Word Graph
MIT License
41 stars 2 forks source link

How does the PTrie format work for compressing the Trie? #5

Open lancejpollard opened 9 months ago

lancejpollard commented 9 months ago

Wondering how you arrived at this compression format? Is it based on a research paper or anything, or how much does it compress the basic JSON structure of the Trie class?

How about adding support for arbitrary other scripts, like Chinese text (70,000+ possible characters used by words), or Amharic (~200 or so characters), or other scripts with fewer characters than that. Would it still work out of the box or would the PTrie compression format/algorithm need to be modified?

When you say "Large dictionaries can be further compressed", does that feature come included in the library, or what do you need to do to make that happen?

Also a stretch question, wondering if you could explain more the compression format, for a complex/larger example. I still am not grasping it fully after reading the readme, I think I will have to play around with some data to test it out and learn from experience perhaps, not sure.

Thanks for your time in advance.

P.S. The PTrie algorithm seems so clever, that optimization seems necessary way further down the road from where I'm at in my journey, thanks so much for sharing!

lancejpollard commented 9 months ago

Here is my play example:

/* eslint-disable @typescript-eslint/no-unsafe-member-access */
import { Trie } from 'dawg-lookup'

const trie = new Trie(
  `
hello
world
apply
apple
appendix
apprehension
apprehend
appreciate
dog
cat
bird
birding
birth
rant
random
rabbit
rack
rancid
bit
bite
dream
clean
cart
code
fish
fist
first
grow
glow
growl
stench
step
stair
tree
treat
trench
`
    .trim()
    .replace(/\n+/g, ' '),
)

const packed = trie.pack()
console.log(packed.split(';').join('\n'))
appHbiDcBdAfi8g6hello,ra4st1tre0world
at,e,n2
air,e0
n0p
ch
bbit,ck,n0
cid,dom,t
low,row0
!l
rst,s0
h,t
og,ream
a0lean,ode
rt,t
r1t0
!e
d0th
!ing
endix,l2re0
ciate,hen0
d,sion
e,y

It looks like things are base 36 encoded toAlphaCode, but that it will grow to arbitrary size? So it should work with Chinese I'm guessing, not sure yet.