matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Remove cft alignment step #274

Open psathyrella opened 5 years ago

psathyrella commented 5 years ago

I swear there was a bunch of discussion on this somewhere, but I guess it wasn't in an issue.

Anyway, it turns out that when originally written, cft didn't realize that partis outputs information on shm indels (probaby because the partis docs sucked more then), so things are already aligned. As I understand it this is a significant fraction of both the code base and run time, so it'd be worth removing.

metasoarous commented 5 years ago

Thanks for posting this issue @psathyrella.

Pinging @matsen here, as my recollection was that the last time we discussed this, Erick had reservations and made a case for sticking to doing the alignment ourselves. That conclusion may have had to do with issues (at the time) with the indel information. However, these have since been resolved. If it makes sense to do so, taking out all the alignment code would, as Duncan suggests, be a boon for both build time and code footprint.

Thoughts on this Erick?

psathyrella commented 5 years ago

I think we'll have to proceed without erick, as he's offline til may, at least as long as Eli still feels like working on this nowish.

I really don't think there's anything that a re-alignment step could be adding. The process of annotation is alignment, and I can't think of a way that re-doing that without any knowledge of the vdj rearrangement process would add information. It seems much more likely to me that there was just a miscommunication about the available information.

psathyrella commented 5 years ago

oh, and to be clear, at least initially, I think we'd definitely want to make it an option to still align, rather than just nuking the code completely. I mean if nothing else, that is mandatory in order to be able to compare before/after to make sure nothing is wrong.

eharkins commented 5 years ago

Happy to get started on turning alignment into an optional step as soon as I have a chance.

matsen commented 5 years ago

The key issue here is what to do with the indel information with partis.

If we don't do anything and just use the indel-reversed sequences, then the indel information is lost and @lauradoepker won't realize that there are indels at all in the sequences, which seems like the worst outcome. I agree that losing the alignment information from partis is bad, but turning the indel information into a MSA that we can pass to a phylogenetic program seems non-trivial. Let's discuss more.

psathyrella commented 5 years ago

How about we have partis do the one extra disambiguation step (overlapping indels) to output aligned seqs. There's already options to tell it to output alignments, they're just usually based on imgt-gapped v sequences, but this would be easy. And probably other people will want it eventually.

lauradoepker commented 5 years ago

I just created a new issue (https://github.com/matsengrp/cft/issues/277) documenting all the discussion we've had on slack and in emails about the QA255.157-Vk indel family and subsequent issues (problems).

eharkins commented 5 years ago

Does https://github.com/matsengrp/cft/pull/284/ affect this discussion of dealing with indel information from partis as it affects the alignment of sequences?

psathyrella commented 5 years ago

After chatting with Erick, and thinking about it a little bit, I think this should be added as a default processing step in partis when it reads its hmm output. Currently it handles shm indels individually for each sequence in the preliminary smith-waterman step. It then makes no modifications to this indel information when it's reading the new, multi-sequence annotation from the hmm. In retrospect, it makes a lot more sense to harmonize all of the indel information from each sequence together once we've decided which sequences to cluster together.

eharkins commented 5 years ago

This plan sounds ideal ^. In the mean time it seems like we can make some changes to the cft alignment step to make sure it only happens when necessary.

The presence of indels is an important determining factor of when aligning would be necessary as many people already said in this issue.

Currently, CFT always uses indel reversed sequences and always aligns, regardless of whether there are any indels. After talking with @psathyrella (correct me if I misrepresent what we talked about), we came up with this desired behavior as far as aligning / using indel reversed sequences:

We should have the ability to use the input_seqs key from partis as opposed to the default(?) behavior of using indel_reversed_seqs

If we are using indel_reversed_seqs or if there are no indels, we should not align.

If we are using non-indel-reversed sequences (input_seqs) and there are indels, we should align.

Does this make sense @lauradoepker @matsen ?

lauradoepker commented 5 years ago

Does "aligning" in this case add gap characters? @eharkins

eharkins commented 5 years ago

@lauradoepker yes, if that is the best alignment according to the program we are using (it can but not necessarily as I understand).

metasoarous commented 5 years ago

If there are gap characters to add (in particular if there are sequences of different length, as we'd expect), aligning will assign gaps. There are cases where you can imagine a set of sequences of the same length getting gaps, but it would require that there were at least two distinct deletion events, and that all sampled sequences have one or the other. Then again, if there's a deletion or insertion from naive, since we align the sampled sequences together with the naive, there would always end up being a gap char somewhere in our particular setting.

matsen commented 5 years ago

This sounds good, though the following:

If we are using indel_reversed_seqs or if there are no indels, we should not align.

makes me want to add some sanity checks in. For example, I'd want to make sure that the sequences didn't get trimmed or shifted in some funky way.

psathyrella commented 5 years ago

If the sequences are coming from partis, any length discrepancy like that would be a pretty bad bug, so I'm inclined to keep it as a crash, since it'd be a case we want to evaluate by hand. The sequences are coming through bcrham, where there's a straight up assertion that they're the same length. I also think that there are cases where aligning sequences that don't need alignment would be bad, since the cft alignment doesn't know anything about vdj rearrangement or the specifics of each sequence's shm indel information.