Closed jbarnoud closed 9 years ago
Ultimately, the way we build the substitution matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.
The substitution matrix (SM) is computed in a very classical way as done originally by Dayhoff (1978) or more recently for the Blosum, PAM, Gonnet, etc. SM for amino acids. The computation cannot be more simple and was presented in our first paper in Proteins 2006, 2008 or more recently the new SM in Biochimie 2011. The fact that ZZdddddZZ aligned with itself as not the same values that ZZeeeeeZZ and ZZeefeeZZ is totally logical as the letter d (PB d) is seen quite often and e and f less. You have the same results with the alignement of the amino acid string WWWWWWWW with itself and AAAAAA with itslef, the first one will be also largely higher. It is totally logical.
Nonetheless, the point is very intersting cc @pierrepo @alexdb27.
On 02/05/15 11:17, Alexandre G. de Brevern wrote:
Ultimately, the way we build the substitution matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.
The substitution matrix (SM) is computed in a very classical way as done originally by Dayhoff (1978) or more recently for the Blosum, PAM, Gonnet, etc. SM for amino acids. The computation cannot be more simple and was presented in our first paper in Proteins 2006, 2008 or more recently the new SM in Biochimie 2011. The fact that ZZdddddZZ aligned with itself as not the same values that ZZeeeeeZZ and ZZeefeeZZ is totally logical as the letter d (PB d) is seen quite often and e and f less. You have the same results with the alignement of the amino acid string WWWWWWWW with itself and AAAAAA with itslef, the first one will be also largely higher. It is totally logical.
— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-98337967.
Sorry. I meant that the distance matrix is not computed in a appropriate way.
Sorry, @jbarnoud , i've read it a little bit fast. In fact if i remember well, most of the time, diagonal is not computed at all, it is set directly at ... zero ...
@jbarnoud the clustering part of PBxplore has never been used in any paper so far. No need to keep a legacy (wrong?) function.
@alexdb27 could you give us a clear answer on the two points below?
Normalization function must be
distance_mat = 1 - (distance_mat - mini)/(maxi - mini) instead of distance_mat = 1 - (distance_mat + abs(mini))/(maxi - mini)
I would like to mention that for the hierarchical clustering in R, the value of the diagonal doesn't matter (either min or 0) and in scipy/scikit-learn, it has to be set at 0.
I re-open this thread because I'm trying to optimize the computation of the distance matrix.
So far so good, my method is between 6-8 times faster to the previous one (thanks to pdist method). I compute only the condensed matrix (i.e upper triangle of the matrix) and with squareform method, I get the whole matrix.
Unfortunately, I have trouble with the normalization. Currently, the diagonal was set to its minimum value and then the minimum value of the whole matrix was searched. With my calculation my diagonal is set to 0, so I search the min value in my condensed matrix but this min value could be the diagonal value... I cannot have the same matrix distance as before.
Since the diagonal value is pointless, I think normalize by this minimum value is not correct. The minimum value should be search in the upper triangle.
What do you think?
So far, the matrix is normalized using the minimum distance that must be on the diagonal. This, of course, assume that the distances are computed for the diagonal. Doing so, the values on the diagonal should be close to 0 on the diagonal, after normalization. Some cells from the diagonal (at least 1) should be 0 after the normalization, the other ones are set to 0 afterward.
You cannot use a value from outside the diagonal as minimum for the normalization. By doing so, you will have nul distances for sequences that are different from each other.
How long would it take to compute the non normalized distances on the diagonal? If you have them, you can get the minimum to use for the normalization. Since, all the values from the diagonal will be set to 0 anyway, you do not have to change your use of pdist, you just have to get the non-normalized values for the diagonal on the side of what you currently do.
On 07/07/15 11:18, Hub wrote:
I re-open this thread because I'm trying to optimize the computation of the distance matrix.
So far so good, my method is between 6-8 times faster to the previous one (thanks to pdist method http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.pdist.html). I compute only the condensed matrix (i.e upper triangle of the matrix) and with squareform method http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.squareform.html, I get the whole matrix.
Unfortunately, I have trouble with the normalization. Currently, the diagonal was set to its minimum value and then the minimum value of the whole matrix was searched. With my calculation my diagonal is set to 0, so I search the min value in my condensed matrix but this min value could be the diagonal value... I cannot have the same matrix distance as before.
Since the diagonal value is pointless, I think normalize by this minimum value is not correct. The minimum value should be search in the upper triangle.
What do you think?
— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-119134428.
Okay, I see your points. Indeed, the normalization should be done through the minimum value of the diagonal, it makes sense. I'll see how i can compute the diagonal.
Guys, I am puzzled about something.
From the substitution matrix, we have: score(d/d) = 221 score(j/j) = 768 score(k/k) = 568 score(j/k) = 196 score(j/d) = -173 score(d/k) = -585
Say we have three sequences: ZZdddZZ, ZZjjjZZZ and ZZZjjkZZZ.
Scores between sequences are: score(ZZdddZZ/ZZdddZZ) = 663 score(ZZjjjZZ/ZZjjjZZ) = 2304 score(ZZjjkZZ/ZZjjkZZ) = 2104
score(ZZjjjZZ/ZZjjkZZ) = 1721 score(ZZjjjZZ/ZZdddZZ) = -519 score(ZZjjkZZ/ZZdddZZ) = -931
The corresponding matrix is:
ZZdddZZ ZZjjjZZ ZZjjkZZ
ZZdddZZ 663 -519 -931
ZZjjjZZ 2304 1721
ZZjjkZZ 2104
If we set the diagonal to the minimum value of the diagonal we get:
ZZdddZZ ZZjjjZZ ZZjjkZZ
ZZdddZZ 663 -519 -931
ZZjjjZZ 663 1721
ZZjjkZZ 663
We then will have an issue while computing the distance.
We should instead set the diagonal to the maximum value of the diagonal :
ZZdddZZ ZZjjjZZ ZZjjkZZ
ZZdddZZ 2304 -519 -931
ZZjjjZZ 2304 1721
ZZjjkZZ 2304
Don't you think?
Yes, it seems we need to set to the maximum, hence distance = 1 - (score -mini)/(maxi - mini)
would equal 0.
Plus in the code, the formula is still: distance = 1 - (score + abs(min))/(maxi - min)
But, what happens if the maximum is not on the diagonal? The distance on the diagonal will not be set to 0...
I guess the maximum of the score matrix is necessary in the diagonal. @alexdb27 could you comment on this?
I am aware the formula to compute the distance is wrong in the code. I would like to correct this today.
On 07/07/15 13:01, Hub wrote:
Yes, it seems we need to set to the maximum, hence |distance = 1 - (score -mini)/(maxi - mini)| would equal 0. Plus in the code, the formula is still: |distance = 1 - (score + abs(min))/(maxi - min)| But, what happens if the maximum is not on the diagonal? The distance on the diagonal will not be set to 0...
— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-119168979.
Before the normalization, we compute a similarity matrix. I cannot see a case where the maximum value is not on the diagonal. Do you ?
@jbarnoud +1
I don't see either but could it worth a test case/check in the code?
I'll add a very simple test. Maybe @alexdb27 can confirme to us that the maximum is always on the diagonal?
A distance matrix is a matrix with distance. So, the diagonal must be at zero.
I propose a little game:
663 -519 -931 ----- 2304 1721 ----- ------- 2104
second step: 663 -519 -931 -519 2304 1721 -931 1721 2104
third step: 1594 412 0 412 3235 2652 0 2652 3035
fourth step: 0 1182 1594 2823 0 583 3035 383 0
fifth and last: 0 4005 4629 ---- 0 966 ---- ---- 0
Sincerely
PS: you write too fast @pierrepo @jbarnoud @HubLot
@alexdb27 I did not really understand the little game ;-( I agree the diagonal of a distance matrix must be 0 but we are not sure how to compute the distance of other parts of the matrix.
On 07/07/15 13:28, Hub wrote:
I don't see either but could it worth a test case/check in the code?
— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-119176900.
@HubLot I have a test version were I check with assertions that the diagonal is 0 (because I do not set it explicitly), that there is no negative values, and that that the maximum value is 1.
@alexdb27 @pierrpo Careful not to confuse the similarity matrix we get from the substitution matrix (where the maximum value should be on the diagonal) with the distance matrix that we get after normalization (where the diagonal must be 0).
Thhe little game is the principle to normalize it in a simple way:
1- the data :
663 -519 -931 ----- 2304 1721 ----- ------- 2104
second step, simply do the matrix 663 -519 -931 -519 2304 1721 -931 1721 2104
third step, in regards to the diagonal, you "positive" the matrix (it is a question , must it be the matrix or the column. I've some personal question to myself) 1594 412 0 412 3235 2652 0 2652 3035
fourth step, diagonal is at zero by row (or column), with a simple difference with the original value of the diagonal:
0 1182 1594 2823 0 583 3035 383 0
fifth and last : back to the half-matrix by adding (j,i) to (i,j) 0 4005 4629 ---- 0 966 ---- ---- 0
@pierrepo I hope it explain you how i see the question
@jbarnoud @HubLot it is too funny !!
@alexdb27 Sorry I did not understand steps 4 and 5...
4th, we take the first line:
1594 412 0 => (a) minus the max give 0 -1182 -1594 (b) then abs => 0 1182 1594
5th step, you have a non-symetric matrix so you add the symetrical cell to the one upper.
we have 0 1182 1594 2823 0 583 3035 383 0
so we want a half matrix (0,0) (0,1) (0,2) ----- (1,1) (1,2) ----- ------ (2,2)
(0,1) is the sum of (0,1) and (1,0) 1182+2823=4005 (0,2) is the sum of (0,2) and (2,0) 1594+3035=4629 (1,2) is the sum of (1,2) and (2,1) 583+383 = 966
It is my point of view others can be done.
line 1 is 0 1182 1594
This issue emerges from comments on the #56 pull request.
The pull request moves the code that build the distance matrix into a
PBlib.distance_matrix
function. I commented about this function:@pierrepo replied and highlighted an other issue n the same function:
The same normalization in used in the code that convert a similarity-score-expressed substitution matrix to a single digit distance-expressed substitution matrix in
PBlib.matrix_to_single_digit
.So, this issue exposes 3 problems:
PBlib.distance_matrix
PBlib.matrix_to_single_digit
Ultimately, the way we build the distance matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.
Anyway, this issue should be fixed separately from the pull request #56 as it may change the output completely. The current way of building the matrix should be kept as
PBlib.matrix_distance_legacy
if it was used in already published papers.cc @pierrepo @alexdb27