matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees
https://matsengrp.github.io/gctree
GNU General Public License v3.0
16 stars 2 forks source link

ValueError: Zero likelihood event #105

Closed chaoszhang closed 5 months ago

chaoszhang commented 2 years ago

Hi,

I received the following error when running gctree: Traceback (most recent call last): File "/home/chaos/.local/bin/gctree", line 8, in sys.exit(main()) File "/home/chaos/.local/lib/python3.8/site-packages/gctree/cli.py", line 734, in main args.func(args) File "/home/chaos/.local/lib/python3.8/site-packages/gctree/cli.py", line 176, in infer forest.mle(marginal=True) File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1095, in mle self.parameters = _mle_helper(self.ll, kwargs) File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1600, in _mle_helper grad_check = sco.check_grad(lambda x: f(x)[0], lambda x: f(x)[1], x_0) File "/home/chaos/.local/lib/python3.8/site-packages/scipy/optimize/_optimize.py", line 1050, in check_grad analytical_grad = grad(x0, args) File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1600, in grad_check = sco.check_grad(lambda x: f(x)[0], lambda x: f(x)[1], x_0) File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1598, in f return tuple(-y for y in ll(x, kwargs)) File "/usr/lib/python3.8/contextlib.py", line 75, in inner return func(*args, **kwds) File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1048, in ll terms = [ File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1049, in [_lltree(cmcounts, p, q), count] for cmcounts, count in self._cm_countlist File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1636, in _lltree logf_data = [CollapsedTree._ll_genotype(c, m, p, q) for c, m in cm_list] File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1636, in logf_data = [CollapsedTree._ll_genotype(c, m, p, q) for c, m in cm_list] File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 308, in _ll_genotype raise ValueError("Zero likelihood event") ValueError: Zero likelihood event

Do you have some clues about what I did wrong to cause this? I installed gctree and phylip using miniconda. By contrast, when I install gctree using pip on a server, I got: Traceback (most recent call last): File "/home/chz069/.local/bin/gctree", line 8, in sys.exit(main()) File "/home/chz069/.local/lib/python3.8/site-packages/gctree/cli.py", line 734, in main args.func(args) File "/home/chz069/.local/lib/python3.8/site-packages/gctree/cli.py", line 169, in infer pp.parse_outfile(args.infiles[0], args.infiles[1], args.root) File "/home/chz069/.local/lib/python3.8/site-packages/gctree/phylip_parse.py", line 131, in parse_outfile trees.append(build_tree(sequences, parents, counts, root)) File "/home/chz069/.local/lib/python3.8/site-packages/gctree/phylip_parse.py", line 259, in build_tree root_parent.children[0].dist = gctree.utils.hamming_distance( File "/home/chz069/.local/lib/python3.8/site-packages/gctree/utils.py", line 19, in new_distance raise ValueError( ValueError: sequences must have equal length, got 738 and 369

For reproducibility, I have the input, code, and temperary files attached. gctree.zip

willdumm commented 2 years ago

Thanks for letting us know about this! I can reproduce the error, but I'm not sure what's causing it yet.

willdumm commented 2 years ago

@chaoszhang I took a closer look, and I understand what was going on here now. I see you were using your own script to write the file abundances.csv, and the sample IDs in this file didn't match the sample IDs in deduplicated.phylip. While this could have worked in theory, there's no reason to write your own abundances.csv file. This is intended to be done with the deduplicate utility.

The idea is that the input alignment (gctree.fasta) should have raw data, with possibly duplicated sequences. deduplicate then identifies the set of unique sequences in the input alignment, assigns these unique sequences unique names like seq1, and optionally writes an abundance file like abundances.csv mapping new unique sequence names to the number of times that sequence showed up in the input alignment.

Your issue can be fixed by changing your call to deduplicate in runGctree.sh. In addition to specifying an --abundance_file argument, I've also specified a --idmapfile argument, which writes a mapping of new unique sequence IDs to old sequence IDs in the input alignment.

deduplicate gctree.fasta --root root --abundance_file abundances.csv --idmapfile idmap.txt > deduplicated.phylip
mkconfig deduplicated.phylip dnapars > dnapars.cfg
dnapars < dnapars.cfg > dnapars.log
export QT_QPA_PLATFORM=offscreen
export XDG_RUNTIME_DIR=/tmp/runtime-runner
export MPLBACKEND=agg
gctree infer outfile abundances.csv 2>&1 | tee gctree.log

This is described in the Quickstart, but I'll make a note to be more clear about where abundances.csv comes from in future versions of the docs, and/or add a more informative error/warning when sequence IDs don't match what's written to abundances.csv.

You may want to consider passing the --verbose flag to gctree to see more information about ranking. It looks like the trees found by dnapars admit many possible maximally parsimonious choices of ancestral sequences, so while only 14 tree topologies are found, there are over 30k labeled trees. Using branching process likelihood to rank these trees results in 20k top-ranked trees with 10 different topologies. This is not unexpected, since all the sequences in your input alignment have an abundance of 1, and the likelihood used by gctree uses unequal observed sequence abundances to choose between trees. Without more informative abundance information, gctree is essentially just reporting (almost) all the trees found by PHYLIP's dnapars.

Again, thanks for reaching out about this. Let me know if you have any more questions. Also, would you mind describing your use-case?

chaoszhang commented 2 years ago

Cool, I will try this script!

chaoszhang commented 2 years ago

Sorry for unable to produce a minimum example this time, but I failed to run gctree infer on 2 out of 50 inputs with 100 sequences. Here are the necessary files to reproduce this issue incuding outfiles: https://drive.google.com/file/d/1Ubyy9YobNwP397WjJyOKw5800dxo48r_/view?usp=sharing

willdumm commented 2 years ago

I was able to reproduce what I assume was your issue, I saw errors while parsing the outfile produced by dnapars in both cases.

Looking at those outfiles I noticed that in both cases dnapars appears to have stopped writing to the files before it was done describing the last tree. I have no idea why this would happen, it could be a bug with dnapars or an issue with your system. Both files contain more than 9000 trees, so the files are about 16 million lines each. I'm sure that has something to do with it.

I was able to fix this by manually editing the outfiles to remove the last (badly-formed) tree. Luckily the outfile format is (kind of) human readable. For example the outfile in the directory labeled 3 looks like this at the end:

  10   seq14        yes    GAGGTGCGAC TGGTGGAGTC TGGGCCTGAG GTTAGAAAGC 
   3     11         no     GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC 
  11   seq20        no     GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC 
  11   seq16        no     GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC 
  11      9         no     GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC 
   9   seq13        no     GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTG

But that last line should be the same length as all the others, since all the sequences have the same length.

The outfile is broken into sections, one for each tree. The description of each tree starts with an ascii drawing like this:

     +-seq98     
  +-68  
  |  |  +seq97     
  |  +-67  
  |     |  +seq95     
  |     +-65  
  |        |  +seq63     
  |        +-41  
  |           +seq62     
  |  
  |  +-seq96     
  |  |  
  |  |     +-seq87     
  |  |  +-58  
  1-66  |  |  +seq88     
  |  |  |  +-59  
  |  |  |     +seq86

This is followed by the description of the tree structure, including internal sequences, which starts with lines that look like this (following the end of the ascii art tree):

  |                                                                 |  +seq2      
  |                                                                 +--2  
  |                                                                    +seq1      
  |  
  +root      

requires a total of    568.000

  between      and       length
  -------      ---       ------
     1          68       0.005420
    68      seq98        0.025745
    68          67       0.010840
    67      seq97        0.017615
    67          65       0.021680

You can fix these outfiles by finding the last tree section in the file (e.g. by searching for requires a total and finding the last match), and deleting everything after the beginning of the last ascii art tree.

After modifying and saving both outfiles, gctree inference ran correctly.

Let me know if you have any questions.

If you'd like, I can send you the modified outfiles that I made by email.

willdumm commented 1 year ago

Marking this as a to-do for the next release -- have gctree automatically truncate malformed outfiles as needed, or produce an error.

willdumm commented 5 months ago

Fixed in #113