Closed NikaAb closed 5 years ago
1000 is a quite high cellular abundance for a single sequence, so it's possible the likelihood takes a while to compute. Is the problem that gctree is taking too long, or is it issuing an error?
Yes, It is a high cellular abundance, but it is common in the dominant clone of some patients with chronic lymphocytic leukemia.
I get this error message :
FloatingPointError: underflow encountered in double_scalars
scons: *** [absolut_count/gctree.inference.parsimony_forest.p] Error 1
scons: building terminated because of errors.
I wonder if we can use the proportion of each sequence in the population (relative count) instead of their absolute count.
Can you post the contents of absolut_count/gctree.inference.log
? Relative abundances won't work because the underlying branching process models integer cell counts. I could also try to debug if you want to send along your input files.
This is the contents of the log file :
number of trees with integer branch lengths: 2
2 trees exhibit unobserved unifurcation from root. Adding psuedocounts to these roots
I send you the input file as well : V37401_J502.txt
I also have other cases with the much higher absolute count(10000 sequences), I tried to divide all the abundance by 100, it worked, but I'm not sure if the result is reliable. Moreover, it is not always possible to do this simplification. Any thoughts on my naive solution? Thanks for your time!
I've pushed a few major changes and a new release of GCtree motivated by this challenging case. This input data now runs without error (although takes several minutes). If you git pull
from master you will get the most updated code. You will have to build a new Conda environment because dependencies have changed (see the updated README.md
).
Additional notes:
dnapars
resulted in only two nearly identical parsimony trees for these data, and both trees have the same GCtree likelihood. So in the end GCtree isn't really helping rank trees for this case.
Dear Mr. DeWitt, I want to ask you about the limit of the abundance for each sequence. I have difficulties in running GCtree on my dataset containing few sequences (10) with a high range of abundance (between 30 up to 1000). Thanks!