semanticize / semanticizest

Standalone Semanticizer
Apache License 2.0
32 stars 15 forks source link

Need better n-gram counting (count-min sketch?) #9

Open larsmans opened 10 years ago

larsmans commented 10 years ago

Exact n-gram counting is too expensive in terms of storage: a few 10s of 1000s of articles take GBs of storage and we need to process millions. I think we can work around this by using two count-min sketches, one for tf and one for df.