matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

Limit of abundance for each sequence #39

Closed NikaAb closed 5 years ago

NikaAb commented 5 years ago

Dear Mr. DeWitt, I want to ask you about the limit of the abundance for each sequence. I have difficulties in running GCtree on my dataset containing few sequences (10) with a high range of abundance (between 30 up to 1000). Thanks!

wsdewitt commented 5 years ago

1000 is a quite high cellular abundance for a single sequence, so it's possible the likelihood takes a while to compute. Is the problem that gctree is taking too long, or is it issuing an error?

NikaAb commented 5 years ago

Yes, It is a high cellular abundance, but it is common in the dominant clone of some patients with chronic lymphocytic leukemia.

I get this error message :

FloatingPointError: underflow encountered in double_scalars
scons: *** [absolut_count/gctree.inference.parsimony_forest.p] Error 1
scons: building terminated because of errors.

I wonder if we can use the proportion of each sequence in the population (relative count) instead of their absolute count.

wsdewitt commented 5 years ago

Can you post the contents of absolut_count/gctree.inference.log? Relative abundances won't work because the underlying branching process models integer cell counts. I could also try to debug if you want to send along your input files.

NikaAb commented 5 years ago

This is the contents of the log file :

number of trees with integer branch lengths: 2
2 trees exhibit unobserved unifurcation from root. Adding psuedocounts to these roots

I send you the input file as well : V37401_J502.txt

I also have other cases with the much higher absolute count(10000 sequences), I tried to divide all the abundance by 100, it worked, but I'm not sure if the result is reliable. Moreover, it is not always possible to do this simplification. Any thoughts on my naive solution? Thanks for your time!

wsdewitt commented 5 years ago

I've pushed a few major changes and a new release of GCtree motivated by this challenging case. This input data now runs without error (although takes several minutes). If you git pull from master you will get the most updated code. You will have to build a new Conda environment because dependencies have changed (see the updated README.md).

Additional notes: