liamrevell / phytools

GNU General Public License v3.0
207 stars 56 forks source link

phyl.pca outputs all imaginary coordinates #85

Closed amsesk closed 3 years ago

amsesk commented 3 years ago

Hey phytools! I've been trying to use the phyl.pca function to generate ordinations. I have never had a problem with this function before, but this one particular tree and count matrix input to the function generates imaginary coordinates for every row and every PC. I have no idea what the issue is, but it seems tied to this dataset.

I'm calling the function like so:

pca = phyl.pca(tree = my_tree, Y = scaled_count_matrix)

Then when I look at the coordinates I see this (snippet only):

PC1              PC2              PC3
Acholeplasma axanthum                          0.0128105618+0i -0.0171490317+0i -0.0090788698+0i
Acholeplasma brassicae                         0.0192223153+0i -0.0072171295+0i  0.0059417746+0i
Acholeplasma hippikon                          0.0073414286+0i -0.0069543587+0i  0.0488289432+0i
Acholeplasma laidlawii                         0.0032912407+0i -0.0127211158+0i  0.0096224232+0i
Acholeplasma oculi                             0.0091531242+0i  0.0018697068+0i  0.0081105079+0i
Acholeplasma palmae                            0.0101501269+0i -0.0025255678+0i  0.0066665859+0i

...and onwards to PC 148

I've used both hellinger and log scale with no luck. If I shrink the matrix down sufficiently, I can sometimes get around the problem. However, I want all those values included. The count matrix is a dataframe with dim = (149, 3781). The matrix has a lot of zeros, but I'm scaling it so I'm not sure what's going on.

prcomp runs the count matrix with no issue, but I obviously want the phylogenetic scaling.

What kind of matrix/tree characteristics would cause this to happen?

Thank you. Totally awesome package by the way - I use it all the time!

liamrevell commented 3 years ago

Hi @amsesk. 0.0128105618+0i is not an imaginary number since the imaginary part is 0i. I'm guessing that one of your eigenvalues is negative. This will cause some of columns of your score matrix to be imaginary. Since this is a matrix, all the columns then show up in the imaginary syntax; however, any score that is +0i is not actually imaginary!

This could be because some of the variables are perfect linear combinations of others. (Normally these are automatically cut off, but it might have slipped through somehow.) Can you print your eigenvalues & see?

amsesk commented 3 years ago

@liamrevell Oh! That's good to know. Okay, so there are imaginary numbers in there somewhere that are setting the type across everything? Would they be denoted by +1i or something like that?

I apologize that I'm not super familiar with what happens under the hood here. Are the eigenvalues in the diagonal matrix at pca$Eval? Then I can make sure I'm looking in the right place.

EDIT: I'm also happy to share any parts of this object if it'd be easier for you to look at the data. I can upload a tsv somewhere.

liamrevell commented 3 years ago

Whatever the imaginary part is, it would be indicated by +(imaginary part)i. E.g., +5.2i or whatever.

I apologize that I'm not super familiar with what happens under the hood here. Are the eigenvalues in the diagonal matrix at pca$Eval? Then I can make sure I'm looking in the right place.

This is correct.

amsesk commented 3 years ago

@liamrevell Got it thank you.

Looks like there are no negative eigenvalues:

> which(Re(pca$Eval)<0)
integer(0)

Should I just convert the PC coordinates to real and go with it? Or is there something else going on?

liamrevell commented 3 years ago

Should I just convert the PC coordinates to real and go with it? Or is there something else going on?

If you don't have any imaginary values in your scores, then you haven't actually done any converting. You can actually treat these scores as real numbers. To remove the +0i from each of them, you can use the function Re -- e.g., Re(0.0128105618+0i).

If you do have imaginary parts to your scores, then this is probably worth investigating further!

Looks like there are no negative eigenvalues:

Because you have more columns than rows in your data frame, some eigenvalues were definitely zero & have been left out. (That's why you end up with fewer PCs than variables in your data matrix.) Did your analysis retain any zero eigenvalues?

amsesk commented 3 years ago

If you don't have any imaginary values in your scores, then you haven't actually done any converting.

I can test for this pretty easily I suppose. I'll have to look up the function for pulling the imaginary coefficent, but as long as I make sure they're all 0 then I should be okay, right? I'm still a little confused why this is happening in this dataset and not others - since both have more columns than rows.

Because you have more columns than rows in your data frame, some eigenvalues were definitely zero & have been left out. (That's why you end up with fewer PCs than variables in your data matrix.)

My dataframe is PFAM domains on the columns (3,781 columns) by the tree (149 tips) on the rows. So yeah, definitely more columns than rows. But I don't think this posed a problem in the other dataset, which is 54 tips by ~5k PFAMs.

Did your analysis retain any zero eigenvalues?

Do you mean by this that the diagonal ran into the right side of the pca$Eval matrix? Yes it did.

amsesk commented 3 years ago

Yeah, so I guess I'm fine:

which(Im(pca$S) != 0)
integer(0)

Still not sure why this phyl.pca acted any different than any of the other ones, but I guess I can collapse all these scores to real numbers without future regret.

Thanks for your help and quick responses!