Distance martix calculation may be wrong

jbarnoud commented 9 years ago

This issue emerges from comments on the #56 pull request.

The pull request moves the code that build the distance matrix into a PBlib.distance_matrix function. I commented about this function:

At this point of the function, we are building a similarity matrix, we only convert it to a distance matrix in the next lines. Therefore, should not the diagonal be the maximum value, so the diagonal will correspond to a distance of 0 once the matrix normalized and converted? In most cases, it should not matter as the minimum value in the diagonal is most likely greater than all values out of the diagonal. Yet, it is not impossible to have cases were it is not true. For instance, ZZdddddZZ as a similarity score of 1105 whith itself, while the similarity score between ZZeeeeeZZ and ZZeefeeZZ is 2265.

@pierrepo replied and highlighted an other issue n the same function:

agree with you @jbarnoud. However, I believe I made an error in the normalization formula. It should be distance_mat = 1 - (distance_mat - mini)/(maxi - mini) instead of distance_mat = 1 - (distance_mat + abs(mini))/(maxi - mini)

The same normalization in used in the code that convert a similarity-score-expressed substitution matrix to a single digit distance-expressed substitution matrix in PBlib.matrix_to_single_digit.

So, this issue exposes 3 problems:

[ ] the diagonal is set to an inappropriate value
[ ] the normalization is wrong in PBlib.distance_matrix
[ ] the normalization is wrong in PBlib.matrix_to_single_digit

Ultimately, the way we build the distance matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.

Anyway, this issue should be fixed separately from the pull request #56 as it may change the output completely. The current way of building the matrix should be kept as PBlib.matrix_distance_legacy if it was used in already published papers.

cc @pierrepo @alexdb27

jbarnoud commented 9 years ago

See http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/distmat.html

alexdb27 commented 9 years ago

Ultimately, the way we build the substitution matrix may not be appropriate as it scramble the information given by the diagonal of the substitution matrix. I would be interested by how such distance matrix are built in the literature.

The substitution matrix (SM) is computed in a very classical way as done originally by Dayhoff (1978) or more recently for the Blosum, PAM, Gonnet, etc. SM for amino acids. The computation cannot be more simple and was presented in our first paper in Proteins 2006, 2008 or more recently the new SM in Biochimie 2011. The fact that ZZdddddZZ aligned with itself as not the same values that ZZeeeeeZZ and ZZeefeeZZ is totally logical as the letter d (PB d) is seen quite often and e and f less. You have the same results with the alignement of the amino acid string WWWWWWWW with itself and AAAAAA with itslef, the first one will be also largely higher. It is totally logical.

alexdb27 commented 9 years ago

Nonetheless, the point is very intersting cc @pierrepo @alexdb27.

jbarnoud commented 9 years ago

On 02/05/15 11:17, Alexandre G. de Brevern wrote:

Ultimately, the way we build the substitution matrix may not be
appropriate as it scramble the information given by the diagonal
of the substitution matrix. I would be interested by how such
distance matrix are built in the literature.
The substitution matrix (SM) is computed in a very classical way as done originally by Dayhoff (1978) or more recently for the Blosum, PAM, Gonnet, etc. SM for amino acids. The computation cannot be more simple and was presented in our first paper in Proteins 2006, 2008 or more recently the new SM in Biochimie 2011. The fact that ZZdddddZZ aligned with itself as not the same values that ZZeeeeeZZ and ZZeefeeZZ is totally logical as the letter d (PB d) is seen quite often and e and f less. You have the same results with the alignement of the amino acid string WWWWWWWW with itself and AAAAAA with itslef, the first one will be also largely higher. It is totally logical.

— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-98337967.

Sorry. I meant that the distance matrix is not computed in a appropriate way.

alexdb27 commented 9 years ago

Sorry, @jbarnoud , i've read it a little bit fast. In fact if i remember well, most of the time, diagonal is not computed at all, it is set directly at ... zero ...

pierrepo commented 9 years ago

@jbarnoud the clustering part of PBxplore has never been used in any paper so far. No need to keep a legacy (wrong?) function.

@alexdb27 could you give us a clear answer on the two points below?

Normalization function must be

distance_mat = 1 - (distance_mat - mini)/(maxi - mini) instead of distance_mat = 1 - (distance_mat + abs(mini))/(maxi - mini)
Which value to we put in the diagonal ? max() ? min() ? 0 ?

HubLot commented 9 years ago

I would like to mention that for the hierarchical clustering in R, the value of the diagonal doesn't matter (either min or 0) and in scipy/scikit-learn, it has to be set at 0.

HubLot commented 9 years ago

I re-open this thread because I'm trying to optimize the computation of the distance matrix.

So far so good, my method is between 6-8 times faster to the previous one (thanks to pdist method). I compute only the condensed matrix (i.e upper triangle of the matrix) and with squareform method, I get the whole matrix.

Unfortunately, I have trouble with the normalization. Currently, the diagonal was set to its minimum value and then the minimum value of the whole matrix was searched. With my calculation my diagonal is set to 0, so I search the min value in my condensed matrix but this min value could be the diagonal value... I cannot have the same matrix distance as before.

Since the diagonal value is pointless, I think normalize by this minimum value is not correct. The minimum value should be search in the upper triangle.

What do you think?

jbarnoud commented 9 years ago

So far, the matrix is normalized using the minimum distance that must be on the diagonal. This, of course, assume that the distances are computed for the diagonal. Doing so, the values on the diagonal should be close to 0 on the diagonal, after normalization. Some cells from the diagonal (at least 1) should be 0 after the normalization, the other ones are set to 0 afterward.

You cannot use a value from outside the diagonal as minimum for the normalization. By doing so, you will have nul distances for sequences that are different from each other.

How long would it take to compute the non normalized distances on the diagonal? If you have them, you can get the minimum to use for the normalization. Since, all the values from the diagonal will be set to 0 anyway, you do not have to change your use of pdist, you just have to get the non-normalized values for the diagonal on the side of what you currently do.

On 07/07/15 11:18, Hub wrote:

I re-open this thread because I'm trying to optimize the computation of the distance matrix.

So far so good, my method is between 6-8 times faster to the previous one (thanks to pdist method http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.pdist.html). I compute only the condensed matrix (i.e upper triangle of the matrix) and with squareform method http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.squareform.html, I get the whole matrix.

Unfortunately, I have trouble with the normalization. Currently, the diagonal was set to its minimum value and then the minimum value of the whole matrix was searched. With my calculation my diagonal is set to 0, so I search the min value in my condensed matrix but this min value could be the diagonal value... I cannot have the same matrix distance as before.

Since the diagonal value is pointless, I think normalize by this minimum value is not correct. The minimum value should be search in the upper triangle.

What do you think?

— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-119134428.

HubLot commented 9 years ago

Okay, I see your points. Indeed, the normalization should be done through the minimum value of the diagonal, it makes sense. I'll see how i can compute the diagonal.

pierrepo commented 9 years ago

Guys, I am puzzled about something.

From the substitution matrix, we have: score(d/d) = 221 score(j/j) = 768 score(k/k) = 568 score(j/k) = 196 score(j/d) = -173 score(d/k) = -585

Say we have three sequences: ZZdddZZ, ZZjjjZZZ and ZZZjjkZZZ.

Scores between sequences are: score(ZZdddZZ/ZZdddZZ) = 663 score(ZZjjjZZ/ZZjjjZZ) = 2304 score(ZZjjkZZ/ZZjjkZZ) = 2104

score(ZZjjjZZ/ZZjjkZZ) = 1721 score(ZZjjjZZ/ZZdddZZ) = -519 score(ZZjjkZZ/ZZdddZZ) = -931

The corresponding matrix is:

           ZZdddZZ  ZZjjjZZ  ZZjjkZZ
ZZdddZZ      663     -519     -931
ZZjjjZZ              2304     1721
ZZjjkZZ                       2104

If we set the diagonal to the minimum value of the diagonal we get:

           ZZdddZZ  ZZjjjZZ  ZZjjkZZ
ZZdddZZ      663     -519     -931
ZZjjjZZ              663      1721
ZZjjkZZ                        663

We then will have an issue while computing the distance.

We should instead set the diagonal to the maximum value of the diagonal :

           ZZdddZZ  ZZjjjZZ  ZZjjkZZ
ZZdddZZ     2304     -519     -931
ZZjjjZZ              2304     1721
ZZjjkZZ                       2304

Don't you think?

HubLot commented 9 years ago

Yes, it seems we need to set to the maximum, hence distance = 1 - (score -mini)/(maxi - mini) would equal 0. Plus in the code, the formula is still: distance = 1 - (score + abs(min))/(maxi - min) But, what happens if the maximum is not on the diagonal? The distance on the diagonal will not be set to 0...

pierrepo commented 9 years ago

I guess the maximum of the score matrix is necessary in the diagonal. @alexdb27 could you comment on this?

I am aware the formula to compute the distance is wrong in the code. I would like to correct this today.

jbarnoud commented 9 years ago

On 07/07/15 13:01, Hub wrote:

Yes, it seems we need to set to the maximum, hence |distance = 1 - (score -mini)/(maxi - mini)| would equal 0. Plus in the code, the formula is still: |distance = 1 - (score + abs(min))/(maxi - min)| But, what happens if the maximum is not on the diagonal? The distance on the diagonal will not be set to 0...

— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-119168979.

Before the normalization, we compute a similarity matrix. I cannot see a case where the maximum value is not on the diagonal. Do you ?

pierrepo commented 9 years ago

@jbarnoud +1

HubLot commented 9 years ago

I don't see either but could it worth a test case/check in the code?

pierrepo commented 9 years ago

I'll add a very simple test. Maybe @alexdb27 can confirme to us that the maximum is always on the diagonal?

alexdb27 commented 9 years ago

A distance matrix is a matrix with distance. So, the diagonal must be at zero.

I propose a little game:

663 -519 -931 ----- 2304 1721 ----- ------- 2104

second step: 663 -519 -931 -519 2304 1721 -931 1721 2104

third step: 1594 412 0 412 3235 2652 0 2652 3035

fourth step: 0 1182 1594 2823 0 583 3035 383 0

fifth and last: 0 4005 4629 ---- 0 966 ---- ---- 0

Sincerely

alexdb27 commented 9 years ago

PS: you write too fast @pierrepo @jbarnoud @HubLot

pierrepo commented 9 years ago

@alexdb27 I did not really understand the little game ;-( I agree the diagonal of a distance matrix must be 0 but we are not sure how to compute the distance of other parts of the matrix.

jbarnoud commented 9 years ago

On 07/07/15 13:28, Hub wrote:

I don't see either but could it worth a test case/check in the code?

— Reply to this email directly or view it on GitHub https://github.com/pierrepo/PBxplore/issues/62#issuecomment-119176900.

@HubLot I have a test version were I check with assertions that the diagonal is 0 (because I do not set it explicitly), that there is no negative values, and that that the maximum value is 1.

@alexdb27 @pierrpo Careful not to confuse the similarity matrix we get from the substitution matrix (where the maximum value should be on the diagonal) with the distance matrix that we get after normalization (where the diagonal must be 0).

alexdb27 commented 9 years ago

Thhe little game is the principle to normalize it in a simple way:

1- the data :

663 -519 -931 ----- 2304 1721 ----- ------- 2104

second step, simply do the matrix 663 -519 -931 -519 2304 1721 -931 1721 2104

third step, in regards to the diagonal, you "positive" the matrix (it is a question , must it be the matrix or the column. I've some personal question to myself) 1594 412 0 412 3235 2652 0 2652 3035

fourth step, diagonal is at zero by row (or column), with a simple difference with the original value of the diagonal:

0 1182 1594 2823 0 583 3035 383 0

fifth and last : back to the half-matrix by adding (j,i) to (i,j) 0 4005 4629 ---- 0 966 ---- ---- 0

alexdb27 commented 9 years ago

@pierrepo I hope it explain you how i see the question

@jbarnoud @HubLot it is too funny !!

pierrepo commented 9 years ago

@alexdb27 Sorry I did not understand steps 4 and 5...

alexdb27 commented 9 years ago

what !!

https://www.youtube.com/watch?v=iCrargw1rrM

alexdb27 commented 9 years ago

4th, we take the first line:

1594 412 0 => (a) minus the max give 0 -1182 -1594 (b) then abs => 0 1182 1594

5th step, you have a non-symetric matrix so you add the symetrical cell to the one upper.

we have 0 1182 1594 2823 0 583 3035 383 0

so we want a half matrix (0,0) (0,1) (0,2) ----- (1,1) (1,2) ----- ------ (2,2)

(0,1) is the sum of (0,1) and (1,0) 1182+2823=4005 (0,2) is the sum of (0,2) and (2,0) 1594+3035=4629 (1,2) is the sum of (1,2) and (2,1) 583+383 = 966

It is my point of view others can be done.

line 1 is 0 1182 1594

pierrepo / PBxplore

Distance martix calculation may be wrong #62