Open joergi-w opened 2 years ago
Some ideas, without looking much at the code:
text_begin
), we also have select and rank support for this vector. The number of texts would then be rank(text_begin, size()) // +1 ??
. This should be constant.size_t
. Should be faster than rank, but will change the index serialisation.nseq == 1
/nseq == 2
check.Question: Do we also need the sizes of individual texts in the collection?
text_begin
as well as rank/select, we could determine the text size (text_size(x) == select(x, text_begin) - select(x + 1, text_begin); // probably off by one
). This should also be constant.(cc @SGSSGene)
For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.
I have experimented with the
size()
function ofseqan3::bi_fm_index<dna4, text_layout::collection>
in order to calculate it myself. According to the documentation, the value ofsize()
includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar =index.size()
- nseq. For nseq == 1 we have a special case with nchar =index.size()
- 2 (because a single sequence has 2 sentinels).I suggest to provide a function
get_text_size()
for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).