simongog / sdsl-lite

Succinct Data Structure Library 2.0
Other
2.18k stars 346 forks source link

Feature Request: Build FM index for a set of strings #346

Open lutteropp opened 7 years ago

lutteropp commented 7 years ago

Hello,

I want to use the csa_wt index for counting DNA-substrings of variable size in a file containing DNA sequencing reads. The file looks like this, i.e. the reads are separated by newline characters:

ACCGTATTTAGCACTGATCGATCGATC AAGGTCGATCGATCGATCACT AAACTACGATCGATCGTACATGCA

Is there a way to tell csa_wt that suffixes spanning a newline character should be ignored in order to speed up the lookup and further reduce the size of the FM index?

ekg commented 6 years ago

I also think this would be very helpful :+1: