k2-fsa / text_search

Some fast-ish algorithms for batch text search in moderate-sized collections, intended for data cleanup
https://k2-fsa.github.io/text_search/
58 stars 14 forks source link

Add renumbering for computing suffix arrays #25

Closed csukuangfj closed 1 year ago

csukuangfj commented 1 year ago

I propose we should revisit whether it is necessary to use int64_t in computing suffix array in the underlying c++ implementation.

I think unint32_t or even uint16_t is enough. Maybe uint8_t is also enough if we split the text into smaller pieces.