matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Indel handling in QA255.157-Vk #277

Closed lauradoepker closed 5 years ago

lauradoepker commented 5 years ago

lauranoges [Feb 26th at 5:28 PM] @csmall @matsen I just discovered a pretty big issue today that may be my fault for not being attentive enough but: QA255.157-Vk seed is part of a clonal family that includes many members with a 3bp deletion. The seed also has the deletion. When partis’ clonal family was generated, the *** that represented the deletions was “reversed” in CFT and the clonal family was reported as full length (without any deletions) AND the seed sequence was modified to NOT include that deletion…. without my knowledge. I know we discussed indels at length and decided to “flag” them somewhere in CFT, but I never saw this flag. Where is are the flags supposed to reveal themselves to me? On web-CFT? This is my first clonal family with an indel, so I’ll have to pave the way here for this one, but the entire lineage analysis is certainly wrong given that we reversed all the deletions in the clonal family. (Please reply in a thread to this message). 28 replies

csmall [3 months ago] Do you mean ---? Maybe * gets used somewhere for deletions upstream of cft, but in most programs that's a stop codon. (sorry for evading the main point of your comment but just want to sanity check for starters)

csmall [3 months ago] CFT does not use indel reversed sequences as input, so I'm not sure how this would have happened.

csmall [3 months ago] Last time we talked about this, I thought we agreed that the indels would make themselves obvious in the sequence alignments. We could certainly add a column to the table or a symbol option to the scatterplot to make it easier to find these clusters. But for large clusters, there are frequently sequences with indels.

csmall [3 months ago] They just might not always be the ones included in the minadcl or seed-lineage-pruned trees.

csmall [3 months ago] Incidentally, are you still using CFTWeb!? We should switch you over to Olmsted!

psathyrella [3 months ago] @csmall she's referring to the partis ascii-art output where I use blue stars for indels, e.g. p.png

psathyrella [3 months ago] also, @lauranoges maybe you missed chris's replies? they weirdly didn't show up at all in my unreads, only saw cause I was looking for this to say something. Arg, which I've now forgotten. sigh

lauranoges [3 months ago] @csmall, yes just seeing these replies, sorry. No, I do not use CFT, that’s why this is a big issue because nothing was flagged for me. I only used partis output and sent that partis output to ecgtheow (@dunleavy005). The real issue is that the seed sequence itself got indel reversed sometime in this process without my knowledge, so when we sent the clonal family (which I think was indel reversed) to ecgtheow, all sequences appear without the deleted codon. We need to flag this if/when it happens. I cannot have my seed sequence changing without my knowledge. @dunleavy005 @psathyrella do we know where in the process this clonal family got indel reversed? The cluster_seqs.fasta file in my ecgtheow output already has the indel reversed seed sequence for 157.Vk.

lauranoges [3 months ago] A follow up issue that’s just as problematic but doesn’t have an obvious or easy solution is that we need to figure out how to deal with lineage reconstruction when we have a family with indels. @matsen and I originally agreed that this is way too hard for now… but now it’s a year+ later and the issue has resurfaced.

psathyrella [3 months ago] partis output has the input and indel-reversed sequence, along with indel info: https://github.com/psathyrella/partis/blob/master/docs/output-formats.md#description-of-keys. Not sure if that helps tracking it down, but that's all i know (edited)

lauranoges [3 months ago] When I view the partis annotations, the seed has the deletion (blue stars that we mentioned above) so why does the output indel reverse these sequences? I thought partis output would not have the clonal sequences + seed indel reversed?

lauranoges [3 months ago] This confuses me because @csmall uses partis output for CFT and we don’t reverse the indels… so that means that partis output isn’t reversed? @psathyrella

psathyrella [3 months ago] sorry, that ^ was supposed to indicate a list: input sequence, indel-reversed sequence, and indel info

lauranoges [3 months ago] Hmm, if it provides both, then maybe this is an ecgtheow problem where we grabbed the indel reversed sequence file instead of the original. @dunleavy005

dunleavy005 [3 months ago] we're definitely using indel-reversed seqs for ecgtheow https://github.com/matsengrp/ecgtheow/blob/master/python/parse_partis_data.py#L43 python/parse_partis_data.py:43 " --remove-mutated-invariants --indel-reversed-seqs") matsengrp/ecgtheowAdded by GitHub

lauranoges [3 months ago] Okay, I’d forgotten that. What’re your thoughts on lineage reconstruction in families with indels? Possible? Difficult? Large scale fix?

lauranoges [3 months ago] @dunleavy005 @matsen

dunleavy005 [3 months ago] definitely not possible using standard techniques and don't know of ways off the top of my head to deal with it, usually people just align seqs and fill in indels, maybe @matsen knows more

csmall [3 months ago] @lauranoges Do you want Bayesian/Beast style lineage reconstruction accounting for indel reversal? Or are you fine with an ML reconstruction (single tree/reconstruction)? We tried to get better handling of indels in the latter, but none of the (several) tools we looked at ended up working (better than dnaml) for ancestral reconstruction in general, and some were downright buggy, so we stopped tinkering on this front. We could potentially look again at these software and see if any of them have improved, but I'm afraid it's a bit unlikely.

csmall [3 months ago]

No, I do not use CFT, that’s why this is a big issue because nothing was flagged for me. This does not jive with your initial comment. Do you mean you don't use CFTWeb but just look directly at the CFT output? I use the non indel reversed output of Partis, so in the CFT alignments, you should see indels when they're there. If not it would suggest upstream problems.

lauranoges [3 months ago] I see. Yes, I wanted Bayesian lineage reconstruction, but you bring up a good point: perhaps I CAN go back to CFTweb or Olmsted for this particular clonal family because it has an indel and use dnaml reconstruction with the indels still in consideration. It won’t be great, but we don’t have any other options for this family since ecgtheow won’t be reliable since it’s using indel-reversed sequences.

lauranoges [3 months ago] Yes @csmall, it’s upstream: partis to ecgtheow.

psathyrella [3 months ago] i could be misinterpreting, but it kind of seems like this is less of an issue with including indels in the phylogenetic reconstruction, and more of an issue with how we've set up the interface for viewing the results for each seed. On the face of it, it sounds like it'd be better to flag indels in the phylo output to make sure they're obvious. But I can think of other cases (e.g. that "inferred naive" from an intern over the summer that was just the consensus of two seeds sequences with no hits from the NGS) where it's really dangerous for us to not be looking at the partis output for every family to make sure things look sensible, and i'm not sure that putting some subset of the partis output (like indels) into the phylo output really addresses that.

lauranoges [3 months ago] right. I think one fix would be to explicitly label the filename (or sequence names?) to include “indel-rev” when there WERE indels to reverse within the clonal family. I understand that all of the families are sent through indel-reversal, but this means nothing for 13/14 of my families. For the 14th family, though, it mattered and I need to be alerted that my seed sequence changed. When I’m working with sequences, I now have QA255.157-Vk with both the original sequence and the reversed sequence… and they have the same name in the partis/ecgtheow output files. <-- that’s the problem.

dunleavy005 [3 months ago] I'm confused, how is CFT doing DNAML reconstruction using indels? @csmall dont you align the sequences, thus reverting the indels?

csmall [3 months ago] That was part of my point actually; The dnaml reconstruction itself doesn't really handle indels particularly well, creating weird inferences at indel sites for the internal nodes in the tree. However, the alignment of the tip sequences ends up preserving indels because we align together with the naive.

csmall [3 months ago] Relevant issues as far as cft is concerned: • https://github.com/matsengrp/cft/issues/131https://github.com/matsengrp/cft/issues/149https://github.com/matsengrp/cft/issues/170

matsen [3 months ago] @lauranoges sorry about the challenges here. It sounds like we'll get a lot of the way there with better labeling.

Phylogenetic inference using indels as informative sequence features is still a very hard problem after decades of work. That's part of what my new grant proposal is about. However, one can do sequence alignment and then the algorithm can treat indel-d areas as having missing data. This is how DNAML works and as we have seen it gives strange results for ASR in the indel-d regions.

lauradoepker commented 5 years ago

Hi there,

I’ve done some work with Duncan’s help to isolate the sequences from QA255.157-Vk’s clonal family that have the 3nt deletion that our seed sequence has. Here is the fasta. Unfortunately, it is an alignment, but I don’t think it’s the best alignment because, even though all seqs are the same length, it has dashes in a few places where I don’t think they should be. Please realign using whatever pipeline you normally use to do so.

I’ve renamed the naive sequence “naive with deletion” because I manually deleted the 3nts that are missing in the rest of these sequences (and our seed). I believe you need the naive to be the same length as the other sequences in the list in order to run ecgtheow and linearham, right Amrit?

Anyway, can you all help me get this smaller clonal family through our pipelines? CFT, ecgtheow, and linear ham? I’m interested to see the updated lineage paths with the deletion in mind.

For CFT, you may want to use the original inferred naive sequence that does NOT have the 3nt deletion… but that’s up to you I think.

Thank you, Laura (I will not attach referenced files now because I believe we're redoing this whole analysis, but if you want them, alert me and I'll send them).

lauradoepker commented 5 years ago

Amrit @dunleavy005 : Yes, ecgtheow will need aligned seqs but linearham doesn't b/c it runs partis annotate first.

Duncan @psathyrella : ah, wait, hold on. Sorry, I meant to ask if you wanted partis run with no indels, rather than just the non-indel-reversed sequences from the current run. That's easy to to do, and probably makes more sense than having the phylo program guess what the proper alignment is. It should be much more accurate to have the v/d/j-aware code in partis do the alignment.

Laura @lauradoepker : Sounds good to me, Duncan. Can you update us all when you’re finished? And let people know where they can get the new clonal family fasta?

psathyrella [6:56 PM] ok i pulled out the sequences from the seed cluster that had shm indel deletions of length 3, and re-annotated and partitioned them in two ways

  1. allowing indels (as in the original run)
  2. forbidding indels

the ascii-art annotation logs for the two are here

less -RS /fh/fast/matsen_e/dralph/partis/qa255.157-Vk-no-indels-study/indeld/*/*.txt

notes

i then wrote fastas with both the input_seqs and indel_reversed keys for both of those options, they're in each subdir ^ with the appropriate name (edited)

Laura @lauradoepker : I think we should use this fasta to rerun ecgtheow lineage analysis for this VK:
/fh/fast/matsen_e/dralph/partis/qa255.157-Vk-no-indels-tudy/indeld/allow_indels/indel_reversed_seqs.fa

lauradoepker commented 5 years ago

Then ensued lots of chatter about forcing this special case through ecgtheow, including conversations about altering ecgtheow to be able to input a fasta file of clonal family sequences along with their germline info that's usually pulled from partis' info.yaml file.

Then we realized that the above fasta is not right - it doesn't even contain the seed (QA255.157-Vk), it's just the largest cluster in the special partition Duncan created.

Eli @eharkins then edited ecgtheow to be able to search for the largest cluster of unique sequence ids that DOES contain the seed.

lauranoges [11:10 AM] Hi @psathyrella, I just analyzed @eharkins’s run of the QA255.157-Vk with indels allowed but the clonal family restricted to the length of sequences of the 157.Vk mature, which has a 3nt deletion. However. The mature in this cluster is the wrong one: it’s the indel reversed sequence that is artificially long (has 3 extra nts, presumably to match the naive). All ecgtheow inferred intermediates also have this length, along with the “indels allowed” inferred naive for this cluster. I want to check with you that this is indeed a cluster that has been filtered on the basis of the sequences allowed to join being the same 3-nt-deleted length as the biological mature sequence that also has this deletion? Or did something go wrong and this is not the right cluster? Can you remind me how you restricted the length of sequences that could be clonal here? @eharkins, in order to run EC, did you “indel reverse” all sequences in the cluster to match then length of the naive? I bet that’s what EC did. If so, I will just ignore all of the reversed nts (the whole inserted codon)

psathyrella [11:13 AM] I don't think i know what sequences ecgtheow pulled from my two output directories, and all I know at this point about how I made the two dirs is what I wrote at the time, so I'll go find that. I think it was on slack? (edited)

psathyrella [11:19 AM] ok found it, and pasted into /fh/fast/matsen_e/dralph/partis/qa255.157-Vk-no-indels-study/notes.txt

it sounds like the immediate issue is maybe just a matter of ecgtheow grabbing indel_reversed_seq instead of input_seq or vice versa? But honestly I feel like we've gone through so many iterations of confusion on this that we maybe should start over from scratch and define super carefully what we want, it seems like we're just getting the wrong thing over and over and it's kinda scary (edited)

lauranoges [11:20 AM] I’m now suspicious that the “inferred” naive for this cluster is likely not the true original naive. It’s probably the sequence in the phylogeny that first branched off after the deletion event — it has one SNP in the J gene and one SNP in the CDR3 compared to the original indel-rev naive that was calculated for the much larger cluster. Yes please. But wasn’t your method @psathyrella to writing the fastas flawed in that you took the largest cluster from whatever partition but these clusters did not contain the seed… so they should be deleted and forgotten about because they are irrelevant. @eharkins hacked around this and wrote a piece of script that chose the cluster that’s largest (counting only unique seq ids) and also contains the seed. If this is correct, let’s get these misleading fastas out of there.

psathyrella [11:24 AM] ok i think a meeting might be a good start... which incidentally erick has just sent us all an email about how we should maybe schedule regular meetings (edited) that sounds familiar. i just deleted them (edited)

eharkins [11:39 AM]

it sounds like the immediate issue is maybe just a matter of ecgtheow grabbing indel_reversed_seq instead of input_seq or vice versa?

This was my interpretation of this. I'm not sure which one of those keys @lauranoges wants for this indel specific study (input seqs since we do want to allow indels?). I'm happy to meet and discuss this, Laura's suggested time of 10am next wednesday works for me. Laura let me know how I can help in the mean time. To answer

in order to run EC, did you “indel reverse” all sequences in the cluster to match then length of the naive? I bet that’s what EC did. If so, I will just ignore all of the reversed nts (the whole inserted codon) Yes ecgtheow by default specifies using indel reversed sequences to the cft script it uses for processing partis data: https://github.com/matsengrp/ecgtheow/blob/master/python/parse_partis_data.py#L57 python/parse_partis_data.py:57 " --remove-mutated-invariants --indel-reversed-seqs") matsengrp/ecgtheowAdded by GitHub

lauranoges [11:43 AM] Great, Erick isn’t available on 5/29 anyway, so we’ll use the first meeting time to specifically discuss this indel lineage effort between the three of us. See you both next Wed at 10

psathyrella commented 5 years ago

At least from my understanding (correct me if I'm remembering wrong) a tldr would be:

In one important family, partis calls an shm indel within the v gene. Since this has big consequences for the ancestral antibody, we want to see what the family and lineage would look like if that isn't actually an shm indel. Rerunning partis with shm indels turned off, indeed things align reasonably sensible to a different v gene. Since the method that determines whether to call shm indels is just match/mismatch and gap open penalties (admittedly fairly optimized), as opposed to a fancy likelihood, it thus seems reasonable to push both with and without through linearham + olmsted to evaluate by eye.

But it turns out that in practice "with indels" and "without indels" can mean approximately sixteen different things in the context of this study, so we've spent several weeks confusing each other about what's what and what should be run on (and I had a bug in the original study output). So we called a meeting to reset and make sure we're all on the same page about exactly what the inputs and outputs are for each step.

psathyrella commented 5 years ago

Another thing we should discuss as regards this study is how to organize things so it's as reproducible as possible, given that (at least for my stuff) it involves adding a bunch of weird steps and options that can't really be incorporated into the usual push-button data workflow. For instance I have the output and extra scripts here /fh/fast/matsen_e/dralph/partis/qa255.157-Vk-no-indels-study/ along with some notes, so I can figure out what I did to make what's there, but it's not ideal. As far as cft and linearham, though, I don't know how it's set up.

lauradoepker commented 5 years ago

Duncan's @psathyrella summary is only mistaken in that indels cannot be turned off to yield a reasonably-aligned V gene. Indels must be allowed to infer the naive and the cluster, then I insist the logical thing to do thereafter is to subcluster the full length and the 3nt-deleted sequences to run two separate "before deletion" and "after deletion" lineage studies. I fear my NGS data won't have enough information to confidently piece together the order of mutations/events in the lineage evolution, but I figure that this is the right approach nonetheless. We will post a meeting summary next after we sit down in person.

psathyrella commented 5 years ago

with indels allowed:

b

with indels forbidden:

p

(both from the ascii txts in the previous dir ^)

But I guess sorting this kind of thing out is why we're having the meeting...

lauradoepker commented 5 years ago

Merging into #279

lauradoepker commented 5 years ago

And this #281