seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
397 stars 81 forks source link

[Search] Query the text size #2934

Open joergi-w opened 2 years ago

joergi-w commented 2 years ago

For calculating e-values or other purposes it is often necessary to query the text size of a (bi) FM index.

I have experimented with the size() function of seqan3::bi_fm_index<dna4, text_layout::collection> in order to calculate it myself. According to the documentation, the value of size() includes sentinels. Assume that I have stored somewhere a list of sequence names, so I know the value nseq = number of indexed sequences. Then for nseq > 1, I can compute the text size nchar = index.size() - nseq. For nseq == 1 we have a special case with nchar = index.size() - 2 (because a single sequence has 2 sentinels).

I suggest to provide a function get_text_size() for the index that performs these calculations. An issue is that we have to keep track of the number of sequences stored in the index (which I could solve with the length of the names list).

eseiler commented 2 years ago

Some ideas, without looking much at the code:

Question: Do we also need the sizes of individual texts in the collection?