lczech / gappa

A toolkit for analyzing and visualizing phylogenetic (placement) data
GNU General Public License v3.0
56 stars 7 forks source link

No convergence in tridiagonal_ql_algorithm() #26

Closed adriaaula closed 10 months ago

adriaaula commented 11 months ago

Hi Lucas and Pierre,

When I run gappa analyze edgepca on multiple jplaces for an specific clade sometimes I obtain the following error:

No convergence in tridiagonal_ql_algorithm()

I tried to debug or at least understand what was happening under the hood, and the two most plausible options could be:

  1. There are multiple samples presenting 0 reads and when it tries to compute the imbalances its not possible.
  2. There are multiple samples with the same amount of reads, a small amount, located in similar locations, and the imbalance obtained is not different from others.

The first option doesn't seem to be the case because there are multiple genes only presenting one 0 and even so throwing the same error. The second option makes more sense to me. Maybe its related to the positioning of these reads.

Any ideas?

lczech commented 11 months ago

Hi @adriaaula,

thanks for reporting this. This occurs in the PCA that is done on the edge imbalances, but I am not sure what is causing it, probably some specific eigenvalues, as a consequence of either of your plausible explanations. It might be solvable by just running that particular algorithm for more iterations, I'll have to see.

If you feel experimental: In the gappa directory, file gappa/libs/genesis/lib/genesis/utils/math/pca.hpp, change line 157 from

size_t max_iterations = 1000

to some higher value, say, 100000, re-compile gappa, and try again.

Otherwise, could you maybe prepare a minimal example (some jplace files, and the exact command for gappa) to reproduce this, so that I can have a look?

Thanks and so long Lucas

adriaaula commented 11 months ago

Hi there!

I have changed the number of iterations but it didn't work either. My command is the following:

gappa analyze edgepca --jplace-path ODPB/*.jplace.gz --file-prefix odpb --components 10

And I attach the specific example, the ODPB gene. I have added the gene information. To give some context: its an extraction of an specific clade (Amoebozoa) form the whole eukaryotic tree using gappa prepare extract previously. I just realized this could be the key to the problem. In fact, I thought that that command 'extracted' that part of the tree (cutting the tree) but I am seeing that the tree is kept and only the placements are lost. Calculating the edgePCA in an almost empty tree would explain the lack of convergence haha.

That would also explain why the KR distance didn't had any problem to calculate it between the multiple samples. I guess for any analysis with just a subclade I should default to the KRD?

Lemme know what you think, and thanks!

ODPB.zip

lczech commented 10 months ago

Dear @adriaaula,

thanks for your patience! Yes indeed, I think that the mostly empty tree might be the issue here. Using gappa examine heat-tree, your example looks like this:

image

Internally, gappa is using eigenvalue decomposition via the Implicit QL Algorithm, which according to some quick internet search can fail to converge on sparse data - but I'm not an expert on that. Maybe at some point in the distant future I might find time to implement a PCA algorithm that can deal with sparse data, or an option in the extract command to remove branches in the tree... But for now, a solution might be to actually cut your tree down to only the relevant part and run placement on that only - see for instance the chunkify command and the associated pipeline (as described in the paper linked from that commend).

As for the KR distance: Yep, that one just computes the earth movers distance, but there is no PCA or any other numerical method involved, so that doesn't have any issues with sparse data. If that distance measure also helps you, sure, that is also a way forward.

Cheers, hope that helps Lucas

adriaaula commented 10 months ago

Hi Lucas,

Yes, given that I subset the dataset this creates the sparsity. I am sticking with KRD thanks a ton :)