Open larsmans opened 10 years ago
Exact n-gram counting is too expensive in terms of storage: a few 10s of 1000s of articles take GBs of storage and we need to process millions. I think we can work around this by using two count-min sketches, one for tf and one for df.
Exact n-gram counting is too expensive in terms of storage: a few 10s of 1000s of articles take GBs of storage and we need to process millions. I think we can work around this by using two count-min sketches, one for tf and one for df.