Limit of abundance for each sequence

NikaAb commented 5 years ago

Dear Mr. DeWitt, I want to ask you about the limit of the abundance for each sequence. I have difficulties in running GCtree on my dataset containing few sequences (10) with a high range of abundance (between 30 up to 1000). Thanks!

wsdewitt commented 5 years ago

1000 is a quite high cellular abundance for a single sequence, so it's possible the likelihood takes a while to compute. Is the problem that gctree is taking too long, or is it issuing an error?

NikaAb commented 5 years ago

Yes, It is a high cellular abundance, but it is common in the dominant clone of some patients with chronic lymphocytic leukemia.

I get this error message :

FloatingPointError: underflow encountered in double_scalars
scons: *** [absolut_count/gctree.inference.parsimony_forest.p] Error 1
scons: building terminated because of errors.

I wonder if we can use the proportion of each sequence in the population (relative count) instead of their absolute count.

wsdewitt commented 5 years ago

Can you post the contents of absolut_count/gctree.inference.log? Relative abundances won't work because the underlying branching process models integer cell counts. I could also try to debug if you want to send along your input files.

NikaAb commented 5 years ago

This is the contents of the log file :

number of trees with integer branch lengths: 2
2 trees exhibit unobserved unifurcation from root. Adding psuedocounts to these roots

I send you the input file as well : V37401_J502.txt

I also have other cases with the much higher absolute count(10000 sequences), I tried to divide all the abundance by 100, it worked, but I'm not sure if the result is reliable. Moreover, it is not always possible to do this simplification. Any thoughts on my naive solution? Thanks for your time!

wsdewitt commented 5 years ago

I've pushed a few major changes and a new release of GCtree motivated by this challenging case. This input data now runs without error (although takes several minutes). If you git pull from master you will get the most updated code. You will have to build a new Conda environment because dependencies have changed (see the updated README.md).

Additional notes:

Phylip's dnapars resulted in only two nearly identical parsimony trees for these data, and both trees have the same GCtree likelihood. So in the end GCtree isn't really helping rank trees for this case.
Trees with 10,000 sequences (which you say you have) should work in theory, but may take quite a long time
releases gctree-classic and the new gctree-liaton can be found here
Your idea of dividing all the abundances will make GCtree run faster, but it not super well motivated. Perhaps a more sound alternative would be to downsample the counts (simulating imperfect sampling of the lineage). Note that you may have fewer unique sequences after doing so.

matsengrp / gctree

Limit of abundance for each sequence #39