Tree inference taking very long time - advice needed

fabio-t commented 1 year ago

Hi all,

I use gctree a lot for BCell clonal trees reconstruction, but have trouble with some of our trees as they take too long (weeks!). In the past using --quick seemed to help but I'm quite stuck on an urgent project and I'm seeking advice.

Attaching a tar.gz with the fasta files, the deduplicated .phyi file (I had to modify the deduplicate script a bit, because our data was already pre-aggregated.. I mentioned this in the past here https://github.com/matsengrp/gctree/pull/60).

Ignoring the deduplication step then, these are the commands I run:

mkconfig --quick f.phyi dnaml > dnaml.cfg
rm outtree outfile
phylip dnaml < dnaml.cfg 2>&1 | tee dnaml.log
gctree infer --verbose --root ${name:0:10} --frame 1 --idlabel outfile abund.csv

As you can see, to reduce time I use the --quick option, and also dnaml as dnapars seem to take ages with some of our clones. But I'm attaching here a case where even these options are taking too long.

What can I change further? Should I tweak the dnaml.cfg file manually? Some gctree option I'm not aware about?

Also, incidentally - do you think it's problematic to use dnaml instead of dnapars?

gctree_IGHV1-26_IGHJ2_42_729.1.tar.gz

wsdewitt commented 1 year ago

Hi Fabio,

I tried running the following commands using your data:

mkconfig --quick f.phyi dnapars > dnapars.cfg
dnapars < dnapars.cfg > dnapars.log
gctree infer --root V1-26_J2 --frame 1 --idlabel outfile abund.csv

These completed in several seconds, producing the tree below (note that only one parsimony tree was found using the --quick option). gctree out inference 1

It is preferable to use dnapars over dnaml, because gctree leverages abundance data to rank otherwise equally parsimonious trees. I believe dnaml will always produce a single tree, so the ranking procedure doesn't have much to work with.

fabio-t commented 1 year ago

Thanks, I'm now trying on a different machine (still pretty powerful) with your commands. How long did it take exactly? It's been already five minutes, but I'll leave it for an hour or more on this machine to see if it stops.

What version of dnapars and gctree are you using @WSDeWitt?

I'm on phylip 3.697 (ubuntu repos) and gctree 4.1.1 (from pip).

I always get these warnings, just thought I'd mention it in case there's some dependency with a wrong version:

/home/fabio/.local/lib/python3.10/site-packages/gctree/branching_processes.py:1694: UserWarning: Some observed sequences are ambiguous. A disambiguation consistent with each dnapars tree will be chosen arbitrarily. Many alternative disambiguated leaf sequences may be possible.
  warnings.warn(
/home/fabio/.local/lib/python3.10/site-packages/historydag/utils.py:991: UserWarning: `utils.sequence_resolutions_count` deprecated. Use the `get_sequence_resolution_count_func` method from an appropriate `parsimony_utils.AmbiguityMap` object`
  warn(message)
/home/fabio/.local/lib/python3.10/site-packages/gctree/branching_processes.py:1791: UserWarning: Parsimony trees have too many ambiguities for disambiguation in all possible ways. Disambiguating trees individually. Gctree may find fewer parsimony trees.
  warnings.warn(
/home/fabio/.local/lib/python3.10/site-packages/historydag/utils.py:991: UserWarning: `utils.sequence_resolutions` deprecated. Use the `get_sequence_resolution_func` method from an appropriate `parsimony_utils.AmbiguityMap` object`
  warn(message)

wsdewitt commented 1 year ago

On macos with phylip 3.695 and gctree 4.1.1, the dnapars step took 6 seconds, and the gctree step took 45 seconds. Is your runtime issue with dnapars or with gctree?

The ambiguity warnings are probably due to the presence of - and N characters in your alignment. I'm not so sure about the historydag warning (but maybe @willdumm can say if this indicates a problem).

fabio-t commented 1 year ago

The issue is with gctree (infer command). It's now been running for almost 3 hours. These are both Linux machines.. I will try on a macbook.

This is quite weird and merits an investigation, because it's two different machines altogether.. anyone in your group can test it on Linux?

fabio-t commented 1 year ago

output files.zip

can you please check if the outtree/outfile (but also the cfg file) I got from dnapars are the same as those you got?

To understand if it's a phylip or a gctree issue, really

wsdewitt commented 1 year ago

Here are some diffs indicating the dnapars results are the same:

outfile:

2c2
< DNA parsimony algorithm, version 3.695
---
> DNA parsimony algorithm, version 3.697

outtree: identical
dnapars.cfg: identical except file path

I agree, this merits some more digging. Thanks for bringing it to our attention.

fabio-t commented 1 year ago

Thank you for a fast response. Let me know if I can help in any way

willdumm commented 1 year ago

I can reproduce this issue, using Will's commands, on our linux server. This is a result of a mistake I made when deprecating the sequence_resolutions function in the historydag package which changed its behavior. So incidentally it was related to the warning you were seeing, although I intended for that warning to be harmless. I suspect gctree infer completed on Will's machine because he was using an earlier version of the historydag package.

- characters were being treated as N characters, which is inconsistent with gctree's interpretation of - as a fifth character. This is fixed in #118. It could also be fixed by downgrading to an earlier version of historydag, although I think it's better to just install the new version once we merge the PR.

Running the gctree infer command on Fabio's data still takes about a minute on our server, but that's because it takes awhile to fit branching process parameters on this tree.

Also, as Will said gctree isn't really doing anything (other than fitting branching process parameters) unless there are multiple trees to rank, so if possible it would be ideal to use dnapars without the --quick option, now that gctree infer should function as expected.

wsdewitt commented 1 year ago

Thanks for the quick fix, @willdumm. I've merged #118 and cut release v4.1.2 (now on PyPI).

@fabio-t, can you upgrade and rerun on your end?

BTW I noticed there are a couple runs of gap characters on almost all sequences due to apparent indels affecting only a couple sequences. It might be worth trying to remove these columns (and perhaps the rows with indels) to see if that improves dnapars runtime.

fabio-t commented 1 year ago

Thanks a lot @WSDeWitt and @willdumm! Seems to work fine, but will report once all the experiments are re-run (about half have completed today). It's not stuck anymore.

Will have to investigate ambiguity removal to see if I can reduce the time further (and so also maybe drop the --quick parameter).

fabio-t commented 9 months ago

Forgot to check back.. but yes, as you may have surmised, this worked perfectly fine. Thanks for the great support :pray:

matsengrp / gctree

Tree inference taking very long time - advice needed #117