Try data set variants - Githubissues

matsen commented 7 years ago

There are several processing steps that go into getting a data set ready for BEAST. These are automated in CFT, but given that we want to try variants we have to do them by hand.

Install dependencies

conda install pandas
conda install -c bioconda colorbrewer
pip install seqmagick

git clone git@github.com:matsen/pandis.git
git clone git@github.com:matsengrp/cft.git

(Below I assume that you have cloned these in a ~/re/ directory, but change below depending on your layout.)

Getting a clonal family

 cp /fh/fast/matsen_e/processed-data/partis/laura-mb/v9/seeds/BF520.1-igh/BF520-h-IgG/run-viterbi-best-plus-0.csv BF520.1-h-IgH.csv

note that light chain is next door.

Extracting clonal family sequences

Note that we should vary the definition of "healthy" to see if we can get more intermediate sequences on the naive-to-seed path.

~/re/pandis/transpose_family.py BF520.1-h-IgH.csv
~/re/pandis/healthy_to_fasta.py BF520.1-h-IgH.family_0.csv
~/re/pandis/get_naives.py BF520.1-h-IgH.csv >> BF520.1-h-IgH.family_0.healthy.fasta

Pruning the tree

FastTree -nt BF520.1-h-IgH.family_0.healthy.fasta > BF520.1-h-IgH.family_0.healthy.tre
~/re/cft/bin/prune.py --naive naive0 --seed BF520.1-igh BF520.1-h-IgH.family_0.healthy.tre -n 100 --strategy seed_lineage > BF520.1-h-IgH.family_0.healthy.seedpruned.100.ids
seqmagick convert --include-from-file BF520.1-h-IgH.family_0.healthy.seedpruned.100.ids BF520.1-h-IgH.family_0.healthy.fasta BF520.1-h-IgH.family_0.healthy.seedpruned.100.fasta

This will give you a fasta you can use to include into the BEAST xml file, or build trees in Geneious!

matsen commented 7 years ago

If we throw away sequences with stop codons, do we get fewer useful intermediate sequences?
If we subsample randomly, do we lose resolution?
How do different pruning strategies change the outcome?
If we use the original sequences rather than the indel reversed sequences, do we see things that look like real indels, or just sequencing noise? (This will require changing https://github.com/matsen/pandis/blob/master/healthy_to_fasta.py to use input_seqs rather than indel_reversed_seqs).

dunleavy005 commented 7 years ago

A little confused about the notion of "getting more lineage sequences". In the snippet above for prune.py, you write -n 100 so only 100 sequences will be outputted from pruning. Do you just mean "vary the healthy definition to see if we get more quality sequences for inference (determined by our posterior runs)?"

Also, there seems to be only 1 pruning strategy that involves a seed lineage. I'm assuming you want me to think of more heuristics to prune with seed lineages (at random, etc.), yes?

matsen commented 7 years ago

Sorry for the imprecise language. I replaced "more lineage sequences" with "more intermediate sequences on the naive-to-seed path". Is that clearer?

I'm assuming you want me to think of more heuristics to prune with seed lineages (at random, etc.), yes?

That wasn't my idea, but hey, any new ideas are welcome! The current seed pruning should be clear from looking at the comments in prune.py, though you could also look at the tail end of https://github.com/matsengrp/cft/issues/12.

If you want to read some discussion of what's current for CFT, see https://github.com/matsengrp/cft/issues/177. It would also be interesting to try random subsampling from the clonal family (re https://github.com/matsengrp/cft/issues/179) and see how that changes the number of intermediate sequences.

@metasoarous ideas are https://github.com/matsengrp/cft/issues/172 (though note that now that we have fixed https://github.com/matsengrp/cft/issues/12 it no longer has the bad behavior described at the beginning of this issue). There's also https://github.com/matsengrp/cft/issues/182 .

lauradoepker commented 7 years ago

I’m all up and running now. I have generated .fasta files for both n = 100 and n = 50 sequences and now I’d like to begin analyzing them… but @matsen you’re right: I’d need a tree-building program that can show me intermediate node sequences if I am to evaluate these ‘health’ cutoffs. I got PAUP* working on Geneious, but it still doesn’t give me the ancestral reconstruction.

Also, I looked at healthy_to_fasta.py and I don't see a cutoff for 'stops' in # Here is our definition of healthy. df['healthy'] = df['in_frames'] & (~ df['stops']) & (~ df['mutated_invariants'])

So it looks like I'll need some guidance for altering the health criteria.

matsen commented 7 years ago

@lauranoges nice work!

Note that healthy_to_fasta.py is my script, and it's throwing out anything with stops right now. That's what ~ df['stops'] does. You'll want to modify that if you want to allow stops. Perhaps you and @dunleavy005 could get together and work on that?

lauradoepker commented 7 years ago

When I run with original input_seqs rather than indel-reversed, I get stuck when the sequences aren't the right length (or alignment?).

"Wrong number of characters for BF520.1-igh: expected 438 but have 382 instead. This sequence may be truncated, or another sequence may be too long."

matsen commented 7 years ago

Right, this will require another multiple sequence alignment step. You can do this in geneious. Try translation align-- this is closest to what we are doing in CFT. If we go this route we will need to add a multiple alignment step to the core inference machinery.

dunleavy005 commented 7 years ago

no stop codons: bf520 1-igh family_0 healthy seedpruned 100 aa_lineage_graph

stop codons included: bf520 1-igh family_0 healthy seedpruned 100 aa_lineage_graph

They look very similar in the edge mutations and confidences. Also, note that, excluding the bottom portion of the graph, the mutations happen at positions 32, 57, 107.

matsen commented 7 years ago

Nice!

lauradoepker commented 7 years ago

By Geneious analysis, it looks like none of the indel sequences are functional, let alone informative, for the BF520.1 seed. They all looks like frame shifts, with plenty of stop codons. Since I did the deep sequencing on B cells from PBMCs from blood samples, I presume this is PCR error or non-functional heavy chain. Either way, we can forget the indel sequences for BF520.1. I could try to check another seed of interest.

matsen commented 7 years ago

OK, cool. To be clear, you are talking about the original sequences, not the indel-reversed sequences, right? Do the indel-reversed sequences also look broken?

lauradoepker commented 7 years ago

@matsen Yes, even the indel-reversed sequences look like crud. For this seed, I'm happy throwing them all away as garbage.

@dunleavy005 In order to tell if the stop codons "make any difference", I'd need to compare a list of intermediate sequences with stop codons allowed and the list of intermediate sequences with stop codons excluded. Do you have these fasta files?

Additionally, I'd want to look at the tree structure too so I can see how many leaves sprout off of which nodes.

matsen commented 7 years ago

@lauranoges this is exactly the sort of thing I was hoping you would sort out!

lauradoepker commented 7 years ago

I'm not sure how to comment on @dunleavy005 's code run_beast_asr.sh but here are my comments:

1) I had to change the paths for several of the lines. You had lib/pandis/ and lib/cft/and I have a different structure in my home directory, so I just replaced lib/ with my own path ~/re/ in the script I cloned.

2) I had to pip install jinja2 - should this be listed as a requirement in the README.md?

3) I ran into an error: "Unable to access jarfile /app/BEAST/1.8.2/bin/beast/lib/beast.jar. I'm running off comp sci's module BEAST which is v1.8.2. I noticed that my ${BEAST_DIR} which happens to be /app/BEAST/1.8.2/bin/beast doesn't have the .jar file in it. Instead, I needed this alternate path for the .jar file "~/.beast/2.4/BEAST/lib/beast.jar" However, when I manually changed to this path, it finds the .jar file but it says there's an unrecognized argument: -warnings. So it didn't work.

Perhaps I need to update my BEAST version? I tried to update and didn't get anyway.

dunleavy005 commented 7 years ago

1) did you git clone with the --recursive flag? That should automatically download cft and pandis and run it fine.

2) oops will add that.

3) EDIT I think you should manually install BEAST as linked below. Strange that the warnings flag doesn't get recognized though I can't explain that too well. If you rerun it with the proper directory , what happens?

dunleavy005 commented 7 years ago

Looking online, beast 2.4 is a newer version of beast called beast2 but we're using the original beast which is at 1.8.4, maybe check http://beast.community/install_on_mac

lauradoepker commented 7 years ago

I'll try the --recursive flag. The manual install of BEAST is tricky because I'm working off stoat which is linux. Their instructions are weird and I don't immediately understand them. The commands to unpack a .tgz failed when I tried them on my command line.

"Installing BEAST on Linux To add: Unpacking tarball. Moving directory to /usr/local/ or some such place. Describing the shell scripts in /bin. Creating symbolic links? Running directly using the .jar file (to allow JavaVM options). Running without command-line options to bring up GUI."

lauradoepker commented 7 years ago

Great. I just ran @dunleavy005 's code to generate fastas for 50, 100, and 150 pruned sequences. Now, I'm looking for ways to alter other parameters (maybe I'm using the wrong words there) like # of stop codons allowed.

We don't need to worry about indels or indelreversed.. let's leave it as-is using indelreversed. They don't matter anyway.

I also need to determine if using merged timepoints vs. single timepoints alters the reconstructed lineages.

Can @matsen think of any other health-related parameters to tinker with?

dunleavy005 commented 7 years ago

@lauranoges Here are the ranked FASTA files for "normal" aka no stop codons, and stop codon included FASTA files. The counts are displayed in the names "inferred_rank#_count#"

normal.txt stop_codons.txt

When I diff the two files, I get:

dunleavy005@amrit-vbox:~/ecgtheow$ python/fasta_diff.py runs/normal/*.fasta runs/stop_codons/*.fasta --filter 25
FILE1 & FILE2:

inferred_2_8302 <---> inferred_4_6586
naive0 <---> naive0
inferred_8_156 <---> inferred_5_4514
inferred_6_223 <---> inferred_12_30
inferred_3_7584 <---> inferred_3_6701
inferred_7_159 <---> inferred_11_53
inferred_5_1835 <---> inferred_2_8161
inferred_10_96 <---> inferred_7_605
BF520.1-igh <---> BF520.1-igh
inferred_4_6845 <---> inferred_6_915

FILE1 ONLY:

inferred_12_52
inferred_11_72
inferred_9_144
inferred_13_25

FILE2 ONLY:

inferred_13_28
inferred_8_347
inferred_10_62
inferred_9_63

The corresponding trees are shown here: https://github.com/matsengrp/ecgtheow/issues/4#issuecomment-318770340.

lauradoepker commented 7 years ago

Notes from meeting with @matsen and @dunleavy005 Current pipeline from deep sequence results to BEAST lineage output: 1) downsampling = to what degree? 2) partis with seed 3) healthy filtering = # of stop codons, in frame, further downsampling, indel reversing, 4) FastTree 5) pruning/trimming = # of desired leaves 6) BEAST 7) ecgtheow

lauradoepker commented 7 years ago

@dunleavy005 Command line flags for: version (i.e. v9, v13, latest) seed (i.e. BF520.1) chain (igh, igk, igl) isotype (G, M, K, L)

lauradoepker commented 7 years ago

@matsen Re: our conversation about indels being garbage, I re-pruned the .fasta that contained 100 alread-pruned sequences plus the ~40 indel-reversed indel sequences from the original .csv. Pruning it back down to 100 did not just prune out all the indel sequences, it left 18 of them in the re-pruned list and must have pruned out 18 of the other sequences that were originally included in the pruned list.

This leaves our conclusion the same: we should continue indel-reversing, try to mark the indel-reversed sequences with some flag in their name so we can identify them later on the trees, and prune on this healthy indel-reversed list so both the indels and non-indel seqs can be in the pruned list. If we become interested in the pruned lineage and we notice that leaves of interest are indels (because their name indicates such), we will investigate.

I want to reference an issue in cft here #157 but I'm not sure how: Indel flagging with asterisks #157

lauradoepker commented 7 years ago

@dunleavy005 can you help me understand --mcmc-iter --mcmc-thin and --mcmc-burnin? I'm changing all the options on the run_beast_asr.sh so it'd be nice to get an explanation sooner rather than later after I run them all and move on :)

dunleavy005 commented 7 years ago

BEAST runs a long chain to sample the trees, and --mcmc-iter records how long the chain is (i.e. how many potential samples you'll have).

BUT, because the tree at iteration "i" is directly dependent on the tree at iteration "i-1", there is some autocorrelation between the trees. Thus, --mcmc-thin tells me "how many samples in the chain should pass before transferring a sample from the chain to my dataset", it allows us to "thin" the correlation so we can get approximately independent samples.

Of course, in the beginning of the BEAST run, the sampler is getting its orientation and isn't quite sampling from the "true distribution" yet. So the --mcmc-burnin records how many beginning samples to throw away, so I can remove that noise from orienting the chain at the start.

lauradoepker commented 7 years ago

Hi @dunleavy005, Duncan (not in this channel) has organized his analyses from different partis runs differently, so the scp stoat:/fh/fast/matsen_e/processed-data/partis/laura-mb/${CFT_VERSION}/seeds/${SEED}-ig${CHAIN}/${SEED%%.*}-${CHAIN}-Ig${ISOTYPE}/run-viterbi-best-plus-0.csv data/${SEED}-ig${CHAIN}.csv step won't work for kate-qrs data for example. Shall we just change this so that we input the path to the run-viterbi-*.csv file so I can run on any/all data that I want? Duncan is rerunning everything we've got through partis so I'll be rerunning ecgtheow on everything soon.

dunleavy005 commented 7 years ago

Yeah, definitely, I'll do it now.

lauradoepker commented 7 years ago

Code-adjustment for @dunleavy005 and myself: To police the efficacy of the prune.py script, we need to Output a FastTree pdf (send the BF520.1-igh.family_0.healthy.tre to FigTree). However, we'd really like to automatically color the taxa so we can identify which ones were selected by the prune.py script (BF520.1-igh.family_0.healthy.seedpruned.100.ids). I have no idea how to do this. I manually changed the color for 25 taxa when I compared nprune 100 vs. 125.

If we can achieve this, we should have Will look it over.

This may also apply to Chris Small's CFT pipeline, since he also uses the prune.py command.

lauradoepker commented 7 years ago

^Maybe we can color it with this: https://github.com/mooreryan/color_tree

matsen commented 7 years ago

I'd suggest http://etetoolkit.org/ which we use for lots of other things 'round here.

dunleavy005 commented 6 years ago

^^Prune coloring is done here (https://github.com/matsengrp/ecgtheow/commit/eea0b4c2cd8867052989d5fdffe9e03dea3a2f46). It's a bit wonky b/c many families have tons of nodes, so its really hard to see, any suggestions on improvement? (edit: maybe seed and naive can be colored differently, to differentiate the prunes from seed/root)

bf520 1-igh family_0 healthy seedpruned 100 ids tre

lauradoepker commented 6 years ago

Hmm, I can make it all sorts of other shapes using FigTree, but I'm not sure that the coloring will come through with the Newick or Nexus file. We could try -- can you upload the tree file? @dunleavy005

dunleavy005 commented 6 years ago

Can you see the tree I posted above? Is that the tree you're asking for?

lauradoepker commented 6 years ago

Yes, that's the FastTree with nodes colored for the sequences that survived the pruning, right? Also -- are the red ones the ones that survived? Or the black? @dunleavy005

dunleavy005 commented 6 years ago

Red ones are the chosen prune nodes , black are all rest.

lauradoepker commented 6 years ago

Oh, I meant upload the Newick or Nexus file @dunleavy005

dunleavy005 commented 6 years ago

tree.txt

dunleavy005 commented 6 years ago

How about this? I think there aren't too many options really, b/c there are tons and tons of seqs. bf520 1-igh family_0 healthy seedpruned 100 ids tre

lauradoepker commented 6 years ago

That looks great @dunleavy005 ! However, I'm confused because I thought BF520.1-igh only had 151 sequences to begin with and our prune script was paring this list down to 100... but I see many more sequences here than 151.

Also, this pruning job makes me a bit nervous. I think you used "prune" (rather than "minADCL trim" so we should be getting sequences along the lineage... this graphic doesn't immediately look like this is the case here.

dunleavy005 commented 6 years ago

No, definitely not, it has 588 sequences including the naive and seed seqs. I used /fh/fast/matsen_e/processed-data/partis/laura-mb/v9/seeds/BF520.1-igh/BF520-h-IgG/run-viterbi-best-plus-0.csv for my dataset.

I've attached a new graphic with naive seq. outgroup. I've tabulated the prune counts for the 6 subtrees on the naive-seed lineage.

1st: 24 2nd: 2 3rd: 24 4th: 24 5th: 1 6th: 23

Definitely seems like pruning is working (at least on a count level).

bf520 1-igh family_0 healthy seedpruned 100 ids tre

lauradoepker commented 6 years ago

Just FYI @metasoarous has updated the prune.py script, so I just git pulled that into my own repository and am rerunning the BF520.1 ecgtheow runs. I'm doing this for the latest versions of laura-mb and laura-mb-2.... IgG and IgK for each.

matsengrp / ecgtheow

Try data set variants #4

Install dependencies

Getting a clonal family

Extracting clonal family sequences

Pruning the tree