matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Add pruned FastTree graphic #250

Closed lauradoepker closed 5 years ago

lauradoepker commented 5 years ago

It's helpful to view this FastTree phylogeny that depicts which sequences were chosen for the prune.py step for lineage analyses. This would also be helpful for the minadcl pruning too. I've pasted @dunleavy005 's SCons code from Ecgtheow that achieves this - I hope I grabbed the right part of the code to show this (?). Example .png of what I'm talking about (red leaves were kept, black leaves were pruned away):

pruned_cluster_fasttree

    @w.add_target()
    def pruned_cluster_seqids(outdir, c):
        pruned_cluster_seqids = env.Command(
            path.join(outdir, "pruned_cluster_seqids.txt"),
            c["cluster_fasttree"],
            "lib/cft/bin/prune.py --naive naive --seed " + options["seed"] + " $SOURCE -n " + str(c["nprune"]["value"]) + " --strategy seed_lineage $TARGET")
        env.Depends(pruned_cluster_seqids, "lib/cft/bin/prune.py")
        return pruned_cluster_seqids

    @w.add_target()
    def pruned_cluster_seqs(outdir, c):
        return env.Command(
            path.join(outdir, "pruned_cluster_seqs.fasta"),
            [c["pruned_cluster_seqids"], c["cluster_seqs"]],
            "seqmagick convert --include-from-file $SOURCES $TARGET")

    @w.add_target()
    def input_seqs(outdir, c):
        return env.Command(
            path.join(outdir, "input_seqs.fasta"),
            c["pruned_cluster_seqs"],
            "awk \'/^[^>]/ {gsub(\"N\", \"-\", $0)} {print}\' < $SOURCE > temp.fasta;" + \
            "seqmagick mogrify --squeeze temp.fasta;" + \
            "awk \'/^[^>]/ {gsub(\"-\", \"N\", $0)} {print}\' < temp.fasta > $TARGET;" + \
            "rm temp.fasta")

    @w.add_target()
    def pruned_cluster_fasttree_png(outdir, c):
        pruned_cluster_fasttree_png = env.Command(
            path.join(outdir, "pruned_cluster_fasttree.png"),
            [c["cluster_fasttree"], c["pruned_cluster_seqids"]],
            "xvfb-run -a python/annotate_fasttree_tree.py $SOURCES --naive naive --seed " + options["seed"] + " --output-path $TARGET")
        env.Depends(pruned_cluster_fasttree_png, "python/annotate_fasttree_tree.py")
        return pruned_cluster_fasttree_png
lauradoepker commented 5 years ago

In Ecgtheow, I am tweaking the nprune value around manually and checking the FastTrees. In CFT, a variety of nprunes may also be helpful (not just always 100), but am I right in assuming that all the ML trees would have to be pre-computed with just a handful of nprune values?

metasoarous commented 5 years ago

Right now things are set up in CFT so that different reconstructions can have different prune counts, but there's no "upstream control mechanism" for which counts you might want to perform for a particular clonal family.

We could do it so that when you run the build you specify the prune count, and can therefrom create a separate dataset that has been run with alternate prune counts. But this isn't really optimal for comparisons, and is sort of clunky on the build side having to switch datasets to look at different prune counts.

So the better thing to do (and part of why I organized the way I did around "reconstructions" in the first place) would be to make it so that for a given clonal family and prune strategy, you can have multiple reconstructions. So maybe when you build at the command line, you can specify --prune-counts=100,300 to compare?

On the far end of the spectrum in terms of control, it might be possible to specify somehow in the input yaml file that cft consumes which clonal families should be run at what prune counts. Moving forward in this direction dovetails with getting us to the point where it would be easier for you to select clonal families from the viz and have ASR trees or ecgtheow analyses computed from them.

metasoarous commented 5 years ago

@lauradoepker Regarding this annotate_fasttree_tree.py script, we should be able to move that into the pipeline easily enough. Just out of curiosity though, how well does this script handle for big trees?

lauradoepker commented 5 years ago

@metasoarous I'm not sure. We haven't ever had problems with it. Some of my biggest clonal families have a couple thousand members.

metasoarous commented 5 years ago

Hey @eharkins; Do you want to help with this? Would basically mean plugging something along the lines of the above into the SConstruct.

eharkins commented 5 years ago

Happy to help, sorry for the delay here. Will read more closely and let you know if I have questions tomorrow.

metasoarous commented 5 years ago

@dunleavy005 @lauradoepker - @eharkins and I are taking a look at this and wondering what all went into the crazy awk step in your code snippet above. Is that something that is needed for your script to function properly or should we be able to safely ignore?

dunleavy005 commented 5 years ago

lol I don't quite remember, BUT either one of RevBayes/BEAST was trimming off columns with full N's so to make the two programs consistent, we removed sequence positions from the FASTA that had 100% N's in the column. Unfortunately, seqmagick --squeeze only trims off -'s, so the awk nonsense is just converting everything to dashes, trimming, then back to N's.

Basically not needed for you guys :)

P.S. my memory aint so bad!!! https://github.com/fhcrc/seqmagick/issues/72

eharkins commented 5 years ago

@lauradoepker closing this, it seems to work for cft now with a few changes. Let me know if you have any issues.

metasoarous commented 5 years ago

This script is choking on the bigger trees. I've tried to take steps to prevent it from becoming an issue by only having it run on the smaller trees, but for some reason this hasn't been working as I'd hoped. Hopefully I'm able to fix this, but if not, I may have to make it so that these targets wouldn't get built unless you specified a special flag (or some such).

lauradoepker commented 5 years ago

Yeah, it fails to render the image (or whatever) when the clonal family is 20,000 or 30,000 members big. Your fix sounds good @metasoarous. cc @dunleavy005

metasoarous commented 5 years ago

Not sure how I never came back around to close this issue, but it's now the case that if you run scons with the --fasttree-png flag, it should build out the fasttree images when it can.