Closed fabio-t closed 9 months ago
Hi Fabio,
I tried running the following commands using your data:
mkconfig --quick f.phyi dnapars > dnapars.cfg
dnapars < dnapars.cfg > dnapars.log
gctree infer --root V1-26_J2 --frame 1 --idlabel outfile abund.csv
These completed in several seconds, producing the tree below (note that only one parsimony tree was found using the --quick
option).
It is preferable to use dnapars
over dnaml
, because gctree
leverages abundance data to rank otherwise equally parsimonious trees. I believe dnaml
will always produce a single tree, so the ranking procedure doesn't have much to work with.
Thanks, I'm now trying on a different machine (still pretty powerful) with your commands. How long did it take exactly? It's been already five minutes, but I'll leave it for an hour or more on this machine to see if it stops.
What version of dnapars and gctree are you using @WSDeWitt?
I'm on phylip 3.697 (ubuntu repos) and gctree 4.1.1 (from pip).
I always get these warnings, just thought I'd mention it in case there's some dependency with a wrong version:
/home/fabio/.local/lib/python3.10/site-packages/gctree/branching_processes.py:1694: UserWarning: Some observed sequences are ambiguous. A disambiguation consistent with each dnapars tree will be chosen arbitrarily. Many alternative disambiguated leaf sequences may be possible.
warnings.warn(
/home/fabio/.local/lib/python3.10/site-packages/historydag/utils.py:991: UserWarning: `utils.sequence_resolutions_count` deprecated. Use the `get_sequence_resolution_count_func` method from an appropriate `parsimony_utils.AmbiguityMap` object`
warn(message)
/home/fabio/.local/lib/python3.10/site-packages/gctree/branching_processes.py:1791: UserWarning: Parsimony trees have too many ambiguities for disambiguation in all possible ways. Disambiguating trees individually. Gctree may find fewer parsimony trees.
warnings.warn(
/home/fabio/.local/lib/python3.10/site-packages/historydag/utils.py:991: UserWarning: `utils.sequence_resolutions` deprecated. Use the `get_sequence_resolution_func` method from an appropriate `parsimony_utils.AmbiguityMap` object`
warn(message)
On macos with phylip 3.695 and gctree 4.1.1, the dnapars step took 6 seconds, and the gctree step took 45 seconds. Is your runtime issue with dnapars or with gctree?
The ambiguity warnings are probably due to the presence of -
and N
characters in your alignment. I'm not so sure about the historydag warning (but maybe @willdumm can say if this indicates a problem).
The issue is with gctree (infer command). It's now been running for almost 3 hours. These are both Linux machines.. I will try on a macbook.
This is quite weird and merits an investigation, because it's two different machines altogether.. anyone in your group can test it on Linux?
can you please check if the outtree/outfile (but also the cfg file) I got from dnapars are the same as those you got?
To understand if it's a phylip or a gctree issue, really
Here are some diffs indicating the dnapars results are the same:
outfile
:
2c2
< DNA parsimony algorithm, version 3.695
---
> DNA parsimony algorithm, version 3.697
outtree
: identicaldnapars.cfg
: identical except file pathI agree, this merits some more digging. Thanks for bringing it to our attention.
Thank you for a fast response. Let me know if I can help in any way
I can reproduce this issue, using Will's commands, on our linux server. This is a result of a mistake I made when deprecating the sequence_resolutions
function in the historydag
package which changed its behavior. So incidentally it was related to the warning you were seeing, although I intended for that warning to be harmless. I suspect gctree infer
completed on Will's machine because he was using an earlier version of the historydag
package.
-
characters were being treated as N
characters, which is inconsistent with gctree's interpretation of -
as a fifth character. This is fixed in #118. It could also be fixed by downgrading to an earlier version of historydag, although I think it's better to just install the new version once we merge the PR.
Running the gctree infer command on Fabio's data still takes about a minute on our server, but that's because it takes awhile to fit branching process parameters on this tree.
Also, as Will said gctree isn't really doing anything (other than fitting branching process parameters) unless there are multiple trees to rank, so if possible it would be ideal to use dnapars without the --quick
option, now that gctree infer should function as expected.
Thanks for the quick fix, @willdumm. I've merged #118 and cut release v4.1.2 (now on PyPI).
@fabio-t, can you upgrade and rerun on your end?
BTW I noticed there are a couple runs of gap characters on almost all sequences due to apparent indels affecting only a couple sequences. It might be worth trying to remove these columns (and perhaps the rows with indels) to see if that improves dnapars
runtime.
Thanks a lot @WSDeWitt and @willdumm! Seems to work fine, but will report once all the experiments are re-run (about half have completed today). It's not stuck anymore.
Will have to investigate ambiguity removal to see if I can reduce the time further (and so also maybe drop the --quick
parameter).
Forgot to check back.. but yes, as you may have surmised, this worked perfectly fine. Thanks for the great support :pray:
Hi all,
I use gctree a lot for BCell clonal trees reconstruction, but have trouble with some of our trees as they take too long (weeks!). In the past using
--quick
seemed to help but I'm quite stuck on an urgent project and I'm seeking advice.Attaching a tar.gz with the fasta files, the deduplicated
.phyi
file (I had to modify the deduplicate script a bit, because our data was already pre-aggregated.. I mentioned this in the past here https://github.com/matsengrp/gctree/pull/60).Ignoring the deduplication step then, these are the commands I run:
As you can see, to reduce time I use the
--quick
option, and also dnaml as dnapars seem to take ages with some of our clones. But I'm attaching here a case where even these options are taking too long.What can I change further? Should I tweak the dnaml.cfg file manually? Some gctree option I'm not aware about?
Also, incidentally - do you think it's problematic to use dnaml instead of dnapars?
gctree_IGHV1-26_IGHJ2_42_729.1.tar.gz