Open BartoszJanuszNA opened 1 year ago
They are computed by the AFLP/FALP library (https://academic.oup.com/bioinformatics/article/32/2/304/1744607). The computation takes a bit for new substitution matrices, so we precompute them for commonly used matrices and hardcode the values.
Milot mentioned already ALP. You can find the relevant logic for the E-value computation in src/alignment/EvalueComputation.h
. E.g. our precomputed values for BLOSUM62 are here.
The first two values are lambda and k.
{"blosum62.out", 11, 1, true, {0.27359865037097330642, 0.044620920658722244834,
1.5938724404943873658, -19.959867650284412122,
1.5938724404943873658, -19.959867650284412122,
30.455610143099914211, -622.28684628915891608,
30.455610143099914211, -622.28684628915891608,
29.602444874818868215, -601.81087985041381216}},
Thank you for quick response! ALP was indeed the missing piece in my puzzle. I am looking at src/alignment/EvalueComputation.h
and I am a little confused about the default params for nucleotide alignment. From these lines I can see default gap open and extend penalties are 7 and 1:
https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/src/alignment/EvalueComputation.h#L56-L62
But the CLI help for search
command shows 5 and 2:
--gap-open TWIN Gap open cost [nucl:5,aa:11]
--gap-extend TWIN Gap extension cost [nucl:2,aa:1]
Is it a bug or are those two unrelated?
Also, I wonder on what dataset have you computed those background nucleotide frequencies? And are those substitution scores used for nucleotide alignment? https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/data/nucleotide.out#L1-L9
I would love to know what is the substitution matrix configuration for nucleotides too. Which one is correct?
Hello! I am trying to understand how exactly are bit scores and E-values calculated. I found the formula in The Statistics of Sequence Similarity Scores and noticed that besides query and target lengths (n, m), it depends on Lambda and K constants, which are tied to the used substitution scores and gap penalties.
I found the function
computeEvalue()
https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/src/util/extractdomains.cpp#L46-L50I can see those constants are hardcoded, which stands in contradiction to what is described in the document above. Where are those values coming from? I found they are present in Table 1 in Altschul, S. F., & Gish, W. (1996). [27] Local alignment statistics. Computer Methods for Macromolecular Sequence Analysis, 460–480. doi:10.1016/s0076-6879(96)66029-7 , but not in the same row.
For bit score calculation https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/lib/alp/sls_alignment_evaluer.hpp#L159-L162 those are variables indeed, but I had trouble determining how are they set are how they are changing. It it happening here? https://github.com/soedinglab/MMseqs2/blob/f5f780acd64482cd59b46eae0a107f763cd17b4d/src/alignment/EvalueComputation.h#L48
Thanks in advance for the help!