soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 191 forks source link

[Question] Bit score and E-value calculation - origin of Lambda and K constants #731

Open BartoszJanuszNA opened 1 year ago

BartoszJanuszNA commented 1 year ago

Hello! I am trying to understand how exactly are bit scores and E-values calculated. I found the formula in The Statistics of Sequence Similarity Scores and noticed that besides query and target lengths (n, m), it depends on Lambda and K constants, which are tied to the used substitution scores and gap penalties.

I found the function computeEvalue() https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/src/util/extractdomains.cpp#L46-L50

I can see those constants are hardcoded, which stands in contradiction to what is described in the document above. Where are those values coming from? I found they are present in Table 1 in Altschul, S. F., & Gish, W. (1996). [27] Local alignment statistics. Computer Methods for Macromolecular Sequence Analysis, 460–480. doi:10.1016/s0076-6879(96)66029-7 , but not in the same row.

For bit score calculation https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/lib/alp/sls_alignment_evaluer.hpp#L159-L162 those are variables indeed, but I had trouble determining how are they set are how they are changing. It it happening here? https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/src/alignment/EvalueComputation.h#L48

Thanks in advance for the help!

milot-mirdita commented 1 year ago

They are computed by the AFLP/FALP library (https://academic.oup.com/bioinformatics/article/32/2/304/1744607). The computation takes a bit for new substitution matrices, so we precompute them for commonly used matrices and hardcode the values.

martin-steinegger commented 1 year ago

Milot mentioned already ALP. You can find the relevant logic for the E-value computation in src/alignment/EvalueComputation.h. E.g. our precomputed values for BLOSUM62 are here. The first two values are lambda and k.


                {"blosum62.out", 11, 1, true,  {0.27359865037097330642, 0.044620920658722244834,
                                                       1.5938724404943873658, -19.959867650284412122,
                                                       1.5938724404943873658, -19.959867650284412122,
                                                       30.455610143099914211, -622.28684628915891608,
                                                       30.455610143099914211, -622.28684628915891608,
                                                       29.602444874818868215, -601.81087985041381216}},
BartoszJanuszNA commented 1 year ago

Thank you for quick response! ALP was indeed the missing piece in my puzzle. I am looking at src/alignment/EvalueComputation.h and I am a little confused about the default params for nucleotide alignment. From these lines I can see default gap open and extend penalties are 7 and 1: https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/src/alignment/EvalueComputation.h#L56-L62

But the CLI help for search command shows 5 and 2:

--gap-open TWIN                  Gap open cost [nucl:5,aa:11]
--gap-extend TWIN                Gap extension cost [nucl:2,aa:1]

Is it a bug or are those two unrelated?

BartoszJanuszNA commented 1 year ago

Also, I wonder on what dataset have you computed those background nucleotide frequencies? And are those substitution scores used for nucleotide alignment? https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/data/nucleotide.out#L1-L9

gailrosen commented 5 months ago

I would love to know what is the substitution matrix configuration for nucleotides too. Which one is correct?