Closed adriaaula closed 10 months ago
Hi @adriaaula,
thanks for reporting this. This occurs in the PCA that is done on the edge imbalances, but I am not sure what is causing it, probably some specific eigenvalues, as a consequence of either of your plausible explanations. It might be solvable by just running that particular algorithm for more iterations, I'll have to see.
If you feel experimental: In the gappa directory, file gappa/libs/genesis/lib/genesis/utils/math/pca.hpp
, change line 157 from
size_t max_iterations = 1000
to some higher value, say, 100000, re-compile gappa, and try again.
Otherwise, could you maybe prepare a minimal example (some jplace files, and the exact command for gappa) to reproduce this, so that I can have a look?
Thanks and so long Lucas
Hi there!
I have changed the number of iterations but it didn't work either. My command is the following:
gappa analyze edgepca --jplace-path ODPB/*.jplace.gz --file-prefix odpb --components 10
And I attach the specific example, the ODPB gene. I have added the gene information.
To give some context: its an extraction of an specific clade (Amoebozoa) form the whole eukaryotic tree using gappa prepare extract
previously. I just realized this could be the key to the problem. In fact, I thought that that command 'extracted' that part of the tree (cutting the tree) but I am seeing that the tree is kept and only the placements are lost. Calculating the edgePCA in an almost empty tree would explain the lack of convergence haha.
That would also explain why the KR distance didn't had any problem to calculate it between the multiple samples. I guess for any analysis with just a subclade I should default to the KRD?
Lemme know what you think, and thanks!
Dear @adriaaula,
thanks for your patience! Yes indeed, I think that the mostly empty tree might be the issue here. Using gappa examine heat-tree
, your example looks like this:
Internally, gappa is using eigenvalue decomposition via the Implicit QL Algorithm, which according to some quick internet search can fail to converge on sparse data - but I'm not an expert on that. Maybe at some point in the distant future I might find time to implement a PCA algorithm that can deal with sparse data, or an option in the extract command to remove branches in the tree... But for now, a solution might be to actually cut your tree down to only the relevant part and run placement on that only - see for instance the chunkify command and the associated pipeline (as described in the paper linked from that commend).
As for the KR distance: Yep, that one just computes the earth movers distance, but there is no PCA or any other numerical method involved, so that doesn't have any issues with sparse data. If that distance measure also helps you, sure, that is also a way forward.
Cheers, hope that helps Lucas
Hi Lucas,
Yes, given that I subset the dataset this creates the sparsity. I am sticking with KRD thanks a ton :)
Hi Lucas and Pierre,
When I run
gappa analyze edgepca
on multiplejplaces
for an specific clade sometimes I obtain the following error:No convergence in tridiagonal_ql_algorithm()
I tried to debug or at least understand what was happening under the hood, and the two most plausible options could be:
The first option doesn't seem to be the case because there are multiple genes only presenting one 0 and even so throwing the same error. The second option makes more sense to me. Maybe its related to the positioning of these reads.
Any ideas?