compare inferences to previous ones

matsen commented 7 years ago

With https://github.com/matsengrp/ecgtheow/commit/542edad138bc7b417ae12597bf248b41d4c2e682 the pull-ignored.sh script should pull in some older inferences from @lauranoges and company. We would like to compare our existing inferences using ecgtheow with these previous inferences. Laura can give details about these files.

My proposal is to make a FASTA diffing tool. It should be simple.

Play around with Biopython's FASTA loading, loading in some FASTAs. Note that the sequences won't be sequences, but will be objects. If needed you can get raw sequence strings back from a seq_obj by using str(seq_obj).
Turn these into dictionaries that are keyed on the sequence and take values as ids. Not tested code, but this will look like seqs = {str(r.seq): r.id for r in SeqIO.parse(args.fasta_path, "fasta"))}
Load in two of these, then look for intersection and symmetric difference. The slickest way to do this is to use set operations after turning the keys into sets.
The desired output is something like the following:

in both:
naive0 <-> naive0
one_sequence_name <-> another_sequence_name

in filename1:
a_seq
another_seq

in filename2:
b_seq

if you want to get fancy you could come up with some serialization format.

Another fancy option would be to integrate this into the ecgtheow graph display, in which a specified set of sequences (with the same naive and seed sequences) were added to the graph as needed, and highlighted with special symbology (such as being boxes rather than circles. But that seems like a reasonable second step given the results of the first.

lauradoepker commented 7 years ago

@matsen do you mean that we're comparing the old clusters that were used for Lauren's project to the new clusters that are deployed now (with better pruning etc)? I got lost when you mentioned "old inferences."

matsen commented 7 years ago

Well, we are doing more than just comparing the clusters-- it takes the whole slew of tools, including tree building, to get the ancestral sequences. So I called them "old inferences".

Not just the ones that are deployed now, but also the ones that we will be developing through this project.

dunleavy005 commented 7 years ago

A little confused here, there don't seem to be sequence names in the file, only sequences (and counts?). This diff printout:

in both:
naive0 <-> naive0
one_sequence_name <-> another_sequence_name

in filename1:
a_seq
another_seq

in filename2:
b_seq

Am I comparing sequence names across 2 files that have equivalent seqs? (like one_sequence_name <-> another_sequence_name) Am I printing out seqs that are unique to each file too? or just names?

dunleavy005 commented 7 years ago

I may have inadvertently answered my own question: I've been wanting to compare the counts of intermediate sequences when I only use non-stop codon sequences vs. include all. This FASTA diff'ing idea would work well if I'm understanding this as follows:

Both: inferred_4_6845 <--> inferred_2_8161 (in my runs, both of these had the same AA seq, and I can quickly see how the counts compare).

Is this what this issue is about?

P.S. it seems the old inferences ONLY have counts attached to them, in fact I don't even know how many posterior samples there were total, so it's hard to compare counts from old<--->new analyses, I guess we can do ranking comparisons and compare relative ranks.

lauradoepker commented 7 years ago

I'm lost here. I hope @matsen can field this one.

matsen commented 7 years ago

Right.

But if inferred_9_333 was only seen in the first file, then we would see

in filename1:
inferred_9_333

This is like the unix diff tool, but where differences are investigated only on the level of the sequences but not the names.

dunleavy005 commented 7 years ago

Example FASTA diff check:

dunleavy005@amrit-vbox:~/ecgtheow$ python/fasta_diff.py runs/normal/*.fasta runs/stop_codons/*.fasta --filter 50
FILE1 & FILE2:

inferred_2_8302 <---> inferred_4_6586
inferred_8_156 <---> inferred_5_4514
inferred_3_7584 <---> inferred_3_6701
inferred_7_159 <---> inferred_11_53
inferred_5_1835 <---> inferred_2_8161
inferred_10_96 <---> inferred_7_605
inferred_4_6845 <---> inferred_6_915

FILE1 ONLY:

inferred_12_52
inferred_6_223
inferred_11_72
inferred_9_144

FILE2 ONLY:

inferred_8_347
inferred_10_62
inferred_9_63

lauradoepker commented 7 years ago

I can't upload .fasta files to GitHub and I don't want to change them to .txts, so the following are on Slack: fastas for all the old inferences, used by Lauren to create and test antibodies. The "fixed bases" means that Zak manually altered some of the node bases to better match what we expected (to fix the funky back mutations that were cropping up).

dunleavy005 commented 7 years ago

Can you put these on stoat at /home/matsengrp/working/matsen/ecgtheow/inferences/2017-07-10?

lauradoepker commented 7 years ago

Yes @dunleavy005

lauradoepker commented 7 years ago

@dunleavy005 I would love these small changes to your output file format to make my life easier:

1) Line 51 replace " <---> " with "and" or "_and_" so there are no spaces or weird characters (Geneious doesn't like that). 2) Line 51, 57, 63 please print a > sign in front of the sequence name so that it is in FASTA format.

Thank you!

lauradoepker commented 7 years ago

Here is the analytical approach I used: 1) Used Amrit’s output of comparing BEAST analysis lineages to lineage fastas from Lauren/Zak 2) Combine the shared sequences, the BEAST only, and the Lauren only sequences with their new names that allow me to tell which group they belonged to (shared or not shared). 3) Extract the sequences that were observed more than 50x 4) Manually manipulate these sequences into a logical chronological sequence and ignore the ones that seem nonsensical.

Here are my conclusions from comparing all these BF520.1 lineages: 1) Including sequences that contain stop codons does not seem to help us. Let's exclude them. 2) ML seems to work a little better than parsimony, as judged by my knowledge of B cell evolution and intuition. 3) These two conclusions stand for both heavy and light chains of BF520.1 antibody. 4) BEAST looks better than the old inferences. Let's us BEAST for lineages of great interest. I like that the sequence name includes the number of times BEAST saw that sequence in the lineage (higher is better). Some of Lauren's old inferences seem incorrect or misguided, but they overlap for the most part and we have a clear path forward: synthesize any new BEAST-identified intermediates and test them to finish out the story. Our methods section is super strong because we employed ML, parsimony, and BEAST and came up with basically the same story no matter how we looked at it.

dunleavy005 commented 7 years ago

@lauranoges changed!

lauradoepker commented 7 years ago

Also, I know you (@dunleavy005) did all that work to translate to AA sequence, but can we get this output in nucleotide sequence? This will help me make judgement calls about codon usage and whether a AA change required one or two base mutations.

dunleavy005 commented 7 years ago

Which output? The diff'ed files output, or each individual ranked list output? I ask b/c common sequences might not be common anymore on the DNA level.

lauradoepker commented 7 years ago

@dunleavy005 I meant the diff'ed files. However, I didn't know about the individual ranked list. (I should check those out... where are they?) I'm okay with the diff'd files being altered because the NT sequences cause some of the shared sequences to separate back out.

lauradoepker commented 7 years ago

Okay, @dunleavy005 and @matsen I made some changes so that I can run BEAST lineages on different version of partis with different nprune values. I committed all these changes to a new branch called laura. To begin with, I'm running "latest" version on IgG with nprune = 100 and it will be done later today. If there are no errors, I'll run all of the following: v9 vs. latest, IgG and IgK, and nprune = 50, 100, and 150. Changes I just made to the scripts (on branch laura): 1) Altered the README.md file so the seed example is correct 2) Altered the run_beast_asr.sh script to correctly cp the .csv from the right path given a flagged/specified version 3) Altered the run_beast_asr.sh to include the CFT_VERSION (which is actually a partis version, haha) in the file names of all the outputs (right after the seedpruned specification)

lauradoepker commented 7 years ago

@dunleavy005 I would like to generate a .png graphic of the trees with their nucleotide sequences in addition to the .pngs of the trees based on amino acid. Can you remind me where to go to change this? I manually diffed the fasta outputs before they were translated, but now I need to run something else to generate the graphic?

dunleavy005 commented 7 years ago

Yeah, I vaguely remember sending you some files last week doing this? If this is that same thing, I think it might make sense to modify trees_to_counted_ancestors() to accept a flag whether we want AA or DNA outputs.

lauradoepker commented 7 years ago

I just manually/visually compared all the BEAST runs for "latest" partis version using nprunes of 50, 75, 100, 125, and 150. I used lower filters to see if that helped... it didn't really. I still like nprune = 100 the best.

It turns out that, for this project, I'd like to order the antibody chains by ordering the biologically-relevant nucleotide sequence that's associated with the inferred translation that we were using in BEAST. (The easier option is to just back-translate the amino acid sequence... but that gives me a sequence that's nothing like the true biological ancestral sequence).

@dunleavy005 can you help me or point me in the right direction so I can match the translations (for example: "inferred_3_6757") to the set of nucleotide sequences that originally translated to that? This is exactly what we talked about last Thursday... I didn't realize I'd want it so soon.

To recount what we discussed: I think we want to change the trees_to_counted_ancestors.py script to output a file (perhaps dictionaries?) that tells us which nt sequences are matched with which aa sequences and also what that aa sequence got named (i.e. "inferred_3_6757")

dunleavy005 commented 7 years ago

Yes, we'd want to output a file that maps each "inferred_X_XX" name to a list of DNA sequences that associate with the AA seq. Alternatively, instead of a list, we output the most frequently occurring DNA sequence that "inferred_X_XX" had.

trees_to_counted_ancestors.py is definitely the file that would need editing for this, anywhere where there's a translate() command is where the DNA--->AA conversion is happening.

dunleavy005 commented 6 years ago

Re ^, a new commit (https://github.com/matsengrp/ecgtheow/commit/2ee087c02a98f15b1b5fd40a118fe5eb70bdc06f) addresses the DNA-AA sequence mapping issue. Now, we also output a file *.dnamap that provides the "inferred_X_XX" names with their associated (DNA sequence, count) pairs.

I've attached sample files for IGH without stop codons. Note that the DNA counts don't add up to the AA count b/c sometimes we observe different DNA sequences in the ASR that map to the same AA, so when we do AA counting the duplicate is ignored.

BF520.1-igh.family_0.healthy.seedpruned.100.aa_lineage_seqs.dnamap.txt BF520.1-igh.family_0.healthy.seedpruned.100.aa_lineage_seqs.fasta.txt

dunleavy005 commented 6 years ago

cc @metasoarous: Re: AA-DNA counting confusions

matsengrp / ecgtheow

compare inferences to previous ones #5