Closed chaoszhang closed 5 months ago
Thanks for letting us know about this! I can reproduce the error, but I'm not sure what's causing it yet.
@chaoszhang I took a closer look, and I understand what was going on here now. I see you were using your own script to write the file abundances.csv
, and the sample IDs in this file didn't match the sample IDs in deduplicated.phylip
. While this could have worked in theory, there's no reason to write your own abundances.csv
file. This is intended to be done with the deduplicate
utility.
The idea is that the input alignment (gctree.fasta
) should have raw data, with possibly duplicated sequences. deduplicate
then identifies the set of unique sequences in the input alignment, assigns these unique sequences unique names like seq1
, and optionally writes an abundance file like abundances.csv
mapping new unique sequence names to the number of times that sequence showed up in the input alignment.
Your issue can be fixed by changing your call to deduplicate
in runGctree.sh
. In addition to specifying an --abundance_file
argument, I've also specified a --idmapfile
argument, which writes a mapping of new unique sequence IDs to old sequence IDs in the input alignment.
deduplicate gctree.fasta --root root --abundance_file abundances.csv --idmapfile idmap.txt > deduplicated.phylip
mkconfig deduplicated.phylip dnapars > dnapars.cfg
dnapars < dnapars.cfg > dnapars.log
export QT_QPA_PLATFORM=offscreen
export XDG_RUNTIME_DIR=/tmp/runtime-runner
export MPLBACKEND=agg
gctree infer outfile abundances.csv 2>&1 | tee gctree.log
This is described in the Quickstart, but I'll make a note to be more clear about where abundances.csv
comes from in future versions of the docs, and/or add a more informative error/warning when sequence IDs don't match what's written to abundances.csv
.
You may want to consider passing the --verbose
flag to gctree
to see more information about ranking. It looks like the trees found by dnapars
admit many possible maximally parsimonious choices of ancestral sequences, so while only 14 tree topologies are found, there are over 30k labeled trees. Using branching process likelihood to rank these trees results in 20k top-ranked trees with 10 different topologies. This is not unexpected, since all the sequences in your input alignment have an abundance of 1, and the likelihood used by gctree uses unequal observed sequence abundances to choose between trees. Without more informative abundance information, gctree is essentially just reporting (almost) all the trees found by PHYLIP's dnapars
.
Again, thanks for reaching out about this. Let me know if you have any more questions. Also, would you mind describing your use-case?
Cool, I will try this script!
Sorry for unable to produce a minimum example this time, but I failed to run gctree infer on 2 out of 50 inputs with 100 sequences. Here are the necessary files to reproduce this issue incuding outfiles: https://drive.google.com/file/d/1Ubyy9YobNwP397WjJyOKw5800dxo48r_/view?usp=sharing
I was able to reproduce what I assume was your issue, I saw errors while parsing the outfile
produced by dnapars in both cases.
Looking at those outfiles I noticed that in both cases dnapars appears to have stopped writing to the files before it was done describing the last tree. I have no idea why this would happen, it could be a bug with dnapars or an issue with your system. Both files contain more than 9000 trees, so the files are about 16 million lines each. I'm sure that has something to do with it.
I was able to fix this by manually editing the outfiles to remove the last (badly-formed) tree. Luckily the outfile format is (kind of) human readable. For example the outfile in the directory labeled 3
looks like this at the end:
10 seq14 yes GAGGTGCGAC TGGTGGAGTC TGGGCCTGAG GTTAGAAAGC
3 11 no GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC
11 seq20 no GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC
11 seq16 no GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC
11 9 no GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTGAAAAAGC
9 seq13 no GAGGTGCAAC TGGTGGAGTC TGGGCCTGAG GTG
But that last line should be the same length as all the others, since all the sequences have the same length.
The outfile is broken into sections, one for each tree. The description of each tree starts with an ascii drawing like this:
+-seq98
+-68
| | +seq97
| +-67
| | +seq95
| +-65
| | +seq63
| +-41
| +seq62
|
| +-seq96
| |
| | +-seq87
| | +-58
1-66 | | +seq88
| | | +-59
| | | +seq86
This is followed by the description of the tree structure, including internal sequences, which starts with lines that look like this (following the end of the ascii art tree):
| | +seq2
| +--2
| +seq1
|
+root
requires a total of 568.000
between and length
------- --- ------
1 68 0.005420
68 seq98 0.025745
68 67 0.010840
67 seq97 0.017615
67 65 0.021680
You can fix these outfiles by finding the last tree section in the file (e.g. by searching for requires a total
and finding the last match), and deleting everything after the beginning of the last ascii art tree.
After modifying and saving both outfiles, gctree inference ran correctly.
Let me know if you have any questions.
If you'd like, I can send you the modified outfiles that I made by email.
Marking this as a to-do for the next release -- have gctree automatically truncate malformed outfiles as needed, or produce an error.
Fixed in #113
Hi,
I received the following error when running gctree: Traceback (most recent call last): File "/home/chaos/.local/bin/gctree", line 8, in
sys.exit(main())
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/cli.py", line 734, in main
args.func(args)
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/cli.py", line 176, in infer
forest.mle(marginal=True)
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1095, in mle
self.parameters = _mle_helper(self.ll, kwargs)
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1600, in _mle_helper
grad_check = sco.check_grad(lambda x: f(x)[0], lambda x: f(x)[1], x_0)
File "/home/chaos/.local/lib/python3.8/site-packages/scipy/optimize/_optimize.py", line 1050, in check_grad
analytical_grad = grad(x0, args)
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1600, in
grad_check = sco.check_grad(lambda x: f(x)[0], lambda x: f(x)[1], x_0)
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1598, in f
return tuple(-y for y in ll( x, kwargs))
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1048, in ll
terms = [
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1049, in
[_lltree(cmcounts, p, q), count] for cmcounts, count in self._cm_countlist
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1636, in _lltree
logf_data = [CollapsedTree._ll_genotype(c, m, p, q) for c, m in cm_list]
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 1636, in
logf_data = [CollapsedTree._ll_genotype(c, m, p, q) for c, m in cm_list]
File "/home/chaos/.local/lib/python3.8/site-packages/gctree/branching_processes.py", line 308, in _ll_genotype
raise ValueError("Zero likelihood event")
ValueError: Zero likelihood event
Do you have some clues about what I did wrong to cause this? I installed gctree and phylip using miniconda. By contrast, when I install gctree using pip on a server, I got: Traceback (most recent call last): File "/home/chz069/.local/bin/gctree", line 8, in
sys.exit(main())
File "/home/chz069/.local/lib/python3.8/site-packages/gctree/cli.py", line 734, in main
args.func(args)
File "/home/chz069/.local/lib/python3.8/site-packages/gctree/cli.py", line 169, in infer
pp.parse_outfile(args.infiles[0], args.infiles[1], args.root)
File "/home/chz069/.local/lib/python3.8/site-packages/gctree/phylip_parse.py", line 131, in parse_outfile
trees.append(build_tree(sequences, parents, counts, root))
File "/home/chz069/.local/lib/python3.8/site-packages/gctree/phylip_parse.py", line 259, in build_tree
root_parent.children[0].dist = gctree.utils.hamming_distance(
File "/home/chz069/.local/lib/python3.8/site-packages/gctree/utils.py", line 19, in new_distance
raise ValueError(
ValueError: sequences must have equal length, got 738 and 369
For reproducibility, I have the input, code, and temperary files attached. gctree.zip