Bayesian analysis integration

matsen commented 5 years ago

For some of the CFs we will want to do an ecgtheow and/or linearham analysis.

Do we want this for each seeded CF?
How will they be run?

lauradoepker commented 5 years ago

1) No we don't, we want to continue selecting by hand 2) We will continue running by hand for now

We would appreciate help with optimizing running things by hand using bash shell scripts to submit batch jobs. @eharkins

metasoarous commented 5 years ago

FWIW, A really simple target would be to add some notation to the info.yaml file CFT ingests which specifies which seeds/clusters you want to run the Bayesian stuff for. Obviously fine if you prefer to keep running things by hand for now.

eharkins commented 5 years ago

We would like to be able to

have the option to run seeded clonal families fastas through linearham automatically from CFT
have a heuristic automation in whatever code we use to run linearham (that being CFT in 1.) which considers the achieved ESS, desired ESS and increases iterations accordingly.

metasoarous commented 5 years ago

Sounds great! Exciting!
Just keep in mind that this logic probably won't want to live in the SConstruct itself (best always for the logic of the SConstruct itself never to depend on the value of an intermediate output file). So if you really want this logic in CFT, you'll need to put it in a script which wraps (calls out to) linearham, does some diagnostics, and then decides whether or not to run again.

eharkins commented 5 years ago

Makes sense. Thanks @metasoarous! As of https://github.com/matsengrp/linearham/commit/7971127e16086dadcdf7c1396f18687fab46092e , Linearham can run directly from partis output (without rerunning partis annotate), assuming the user knows the cluster index of the cluster they'd like to process in LH. This would certainly make it easy to run LH from the CFT context where we have the partis output file and cluster index on hand in the SCons process and could pass to a script.

However, I wonder if it wouldn't be simpler to separate out the running of LH into it's own scons which just takes partis output and information to determine the cluster index, and runs LH to desired ESS.

The reason it made sense to include running LH as a target of CFT to me was in part because LH previously needed a clonal family fasta as input, which we were taking manually from CFT anyway when we ran LH.

Now that LH is capable of extracting the clonal family fasta of interest (could make sense to reuse CFT script for this which allows things like filtering indels) I don't see a reason to include the running of LH in the CFT pipeline other than convenience of lumping all of our clonal family analyses into one master scons.

The other side of the argument is that if Linearham is going to end up doing lots of similar processing of the clonal sequences (filtering indels, pruning, etc) then maybe it makes sense to include running it in CFT. But I don't think this is a very strong case. @matsen what are your thoughts?

metasoarous commented 5 years ago

I mostly defer to others who might be using this on more of a day to day basis. If it makes sense to maintain a separate flow, that's fine. I do however see some advantages to going the way of a shared build pipeline:

Potentially easier to ingest Bayesian analyses into Olmsted at some point, for visualization there along with the rest of your data
Uniform pre-processing, as you suggest with filtering & pruning & such

It however comes at the cost/risk of:

Having to specify the clusters upon which you wish to apply this process
Meaning you have to run CFT if you want to run whatever code it is you're working on

Of course, there's a way out of the latter of these risks, in that you can write whatever code you are looking to write not as a separate SConstruct, but just some python script. This would let you then call it from the current SConstruct if that's what you also want to do (now or later).

...

PS Totally trolling you here ^

eharkins commented 5 years ago

To be clear, @metasoarous is only trolling on the last line "..." since he knows I don't appreciate the abuse of ellipses 💬

Based on recent history of running Linearham for @lauradoepker, we want to be able to run Linearham on any version of a cluster to the extent that we have defined a way of arriving at that version of the cluster (which generally happens via cft/bin/process_partis.py)
We would also like to preserve the annotations created with special options run by Duncan, and take our annotations directly from partis output files.

If either of 1. or 2. are not valid assumptions, the rest of this is invalid as well and we should discuss goals for running Linearham on partis output.

As is, Linearham, when not provided with --partis-yaml-file, will require running partis annotate with the option --run-partis.

When provided with --partis-yaml-file and a cluster index corresponding to a cluster annotation in that partis output file, it will take cluster sequences according to https://github.com/matsengrp/linearham/blob/master/scripts/parse_cluster_seqs.py#L30

I don't think this matches the full cluster / annotation processing extent of cft/bin/process.py. So if we want to automate the running of Linearham on any CFT processed cluster, I think we need a way to interface with the options of cft/bin/process.py when we run Linearham for given cluster.

I think regardless of integration Linearham runs into the CFT pipeline or not, this will mean either:

modifying Linearham to allow it to directly use cft/bin/process.py instead of taking just input_seqs or indel_reversed_seqs
modifying Linearham so that it will take a processed (e.g. by CFT) cluster fasta as is (along with its annotation) with out rerunning partis annotate or doing something like this

If this makes sense, I'm happy to do one or the other and other necessary work, and/or chat about it in person first @matsen

eharkins commented 5 years ago

As discussed today with @dunleavy005 and @psathyrella :

Plan is to extend restrict_to_iseqs to rewrite accurate linearham info for the subset to a new partition.yaml file when we are subsetting a cluster in cft/bin/process_partis.py.

For rewriting linearham info post-pruning (another way we subset the cluster sequences which happens separately from cft/bin/process_partis.py) , we will have to do the same process of rewriting the partis file with recomputed LH info, so I will think about the best way for this to happen so that we are not doing this twice if we don't have to.

Either way, this will hopefully mean no change necessary to Linearham; once we have recomputed LH info given an existing partition.yaml and the subset of sequences we care about, we can pass this to LH with --partis-yaml-file.

eharkins commented 4 years ago

As promised in my last comment we can now create a partis output yaml file containing the annotation for only the pruned set of sequences from cft. Linearham can be run on this partis output (without rerunning partis in any way) using the command we output to a text file. This hopefully makes life easier in terms of running Linearham on CFT processed (with default settings or otherwise) partis clusters.

Deferring to @lauradoepker and @matsen for whether to pursue further automation in the form of running Linearham from the CFT pipeline via some special behavior.

metasoarous commented 4 years ago

Just to chime in, the solution @eharkins has come up with here is really nice as far as not overcomplicating CFT with a bunch of logic for figuring out which clusters to run Linearham on. You simply specify a flag when you run scons that indicates that you want the pruned partis files created, and it helpfully generates the Linearham command for running on said subset, but doesn't yet actually run it for you. This is nice (for now at least) because it means that we don't have to choose between a) running Linearham for every possible seed/whatever cluster b) adding some configuration to CFT for specifying which clusters you want to run Linearham on. You can simply worm you way down to the specific CFT output directory (or directories) in question, and run the Linearham command(s) which has been written out there. There's potentially value in taking things further, but this is a really great first step.

eharkins commented 4 years ago

@matsen @lauradoepker should we move this into long-term category on the project unless we are seeking further automation in the near future?

matsengrp / cft

Bayesian analysis integration #276