Question: where sampled suffix array values are stored?

lh3 commented 1 month ago

I am very impressed by MONI. I have a few questions about its index.

I can see the prefix.idx file keeps the sequence names among other things. What does prefix.plain.slp keep? Does it keep RL-BWT or additional data? What about prefix.thrbv.ms? Based on the file name, I guess it is related to thresholds and match statistics. Where is sampled suffix array stored?
MONI can optionally generate "shaped" SLP. The default is "plain". Which one is generally preferred? I assume it is "plain"?
For highly repetitive sequence collections, what fraction does each component (e.g. RL-BWT, sampled SA, thresholds, auxiliary data structures of matching statistics, etc) take approximately in the index files?
What is the relationship between MONI and SPUMONI? My impression is that SPUMONI v1 is adapted from MONI and keeps the full BWT in the base space, but v2 starts to use minimizer digests. With v2, we may not be able reconstruct the input sequences if we choose k<w. Is my description about right?

Thanks!

maxrossi91 commented 1 month ago

Thank you very much. Let me answer to you point by point.

I can see the prefix.idx file keeps the sequence names among other things. What does prefix.plain.slp keep? Does it keep RL-BWT or additional data? What about prefix.thrbv.ms? Based on the file name, I guess it is related to thresholds and match statistics. Where is sampled suffix array stored?

Let me describe all the components:

prefix.idx stores the sequence names and the start position of each sequence in the concatenation of the contigs in the input FASTA. This is used to be able to retrieve the contig name of MEM matches.
prefix.plain.slp stores a straight-line program (SLP) representation of the concatenation of the contigs in the input FASTA. This is used to compute the lengths of the matching statistics in the two-pass algorithm presented in [1].
The prefix.thrbv.ms stores the RL-BWT, thresholds, and suffix array (SA) samples at the beginning and end or each RL-BWT run. This stores all the components of the r-index necessary to compute the matching statistics pointers.

MONI can optionally generate "shaped" SLP. The default is "plain". Which one is generally preferred? I assume it is "plain"?

Yes, we generally prefer to use the plain because it is faster to query (albeit ~2x larger) than the shaped version of the SLP. The shaped version was introduced in [1].

For highly repetitive sequence collections, what fraction does each component (e.g. RL-BWT, sampled SA, thresholds, auxiliary data structures of matching statistics, etc) take approximately in the index files?

Unfortunately, at the moment I do not have a further breakdown of the exact required space fraction for each component. However, in general the highest fraction is taken by the SA samples. All three components space is proportional to the number of runs of the BWT, however the SA samples requires log(n) bits of space per entry where n is the length of the concatenation of the contigs in the input FASTA. The RL-BWT is stored using a Wavelet tree for the run-heads and Elias-Fano compressed bitvectors to store the run lengths, while the thresholds are stored with one Elias-Fano compressed bitvector per character in the input alphabet for a total of O(r) values across all alphabet characters.

What is the relationship between MONI and SPUMONI? My impression is that SPUMONI v1 is adapted from MONI and keeps the full BWT in the base space, but v2 starts to use minimizer digests. With v2, we may not be able reconstruct the input sequences if we choose k<w. Is my description about right?

Yes, SPUMONI (v1) was branched from MONI. The main difference with MONI is that SPUMONI does not store the SLP that in MONI is used to compute the matching statistics lengths. This enables higher query speed when computing the Pseudo-matching lengths. Yes, SPUMONI v1 stores the BWT of the text while SPUMONI v2 stores minimizer digest, so in SPUMONI v2 we would not be able to reconstruct the input sequence if k<w.

Hope this clarifies. Please let me know if you have further questions.

Bests,

Max

[1] Bannai, Hideo, Travis Gagie, and Tomohiro I. "Refining the r-index." Theoretical Computer Science 812 (2020): 96-108. [2] Gagie, Travis, Tomohiro I, Giovanni Manzini, Gonzalo Navarro, Hiroshi Sakamoto, Louisa Seelbach Benkner, and Yoshimasa Takabatake. "Practical random access to SLP-compressed texts." In International Symposium on String Processing and Information Retrieval, pp. 221-231. Cham: Springer International Publishing, 2020.

lh3 commented 1 month ago

Thanks for the quick and detailed reply! I have a follow up question: how does MONI store suffix array samples? Are these kept in a bit-packed array? In that case, each run will take $2\lceil\log_2(n)\rceil$ bits in exact. Is that right?

For disclosure, I am thinking whether to implement an r-index and want to learn from MONI.

maxrossi91 commented 1 month ago

Yes, MONI stores the SA samples in a bit-packed array. I am using the int_vector<> implementation of SDSL (https://github.com/simongog/sdsl-lite/blob/master/include/sdsl/int_vector.hpp).

For disclosure, I am thinking whether to implement an r-index and want to learn from MONI.

This is nice. Let me leave you some notes here then, hoping they may help. So, the r-index construction implemented in the ms_pointers.hpp does not support a locate query, since it does not build the compressed $\varphi$ function ($\varphi(SA[i]) = SA[i-1]$) necessary for the locate of occurrences of query patterns. The implementation of $\varphi$ construction and query is, e.g., in (https://github.com/maxrossi91/moni-align/blob/408c8a03f5dcd92f5487a4e707a8272e3b602fd9/include/ms/moni.hpp#L186) and it is a predecessor data structure built on the SA samples, which requires additional $r\lceil\log_2(n)\rceil$ bits plus $r\log_2(n/r)$ for a Elias-Fano encoding of the predecessor search structure. Note that this allows to locate all occurrences when a whole pattern is matched, but it is not enough to locate all occurrences of a pattern substring like a MEM for which it is necessary to have also the $\varphi^{-1}$ function (where $\varphi^{-1}(SA[i]) = SA[i+1]$) implemented as a predecessor search data structure as well.

lh3 commented 1 month ago

Lots to digest! I am closing this issue for now and might come back to it when I have more technical questions. Thanks again for the explanation.

maxrossi91 / moni

Question: where sampled suffix array values are stored? #7