matsengrp / cft

Clonal family tree
5 stars 3 forks source link

How to filter sequences #177

Closed matsen closed 7 years ago

matsen commented 7 years ago

I have been playing with the BF520.1 sequences, as you know.

When looking at the original clonal family sequences (the alignment of which is shown in #175) I saw lots of indels, including one nutso one. But when I filtered by sequences that had the invariant codons there, were in-frame according to partis, and didn't have stop codons, all those went away.

So, @lauranoges , what sorts of filters do you want?

I think @metasoarous may need to check, but he indicated that we may be allowing sequences through that have one stop codon. Do you want that? I'd think we'd also want the invariant Cys/Tryp codons to be there too.

Thoughts?

metasoarous commented 7 years ago

We should also check in with @psathyrella on what he's doing now. Some of the hacky stuff I had to do was because at the time, some of his checks and such weren't set up quite right. I'll look up the details of what I've been running and update here.

lauradoepker commented 7 years ago

Sure I think one stop codon maximum is reasonable just because PCR error may occur and cause that sometimes. Invariant codons means landmark residues around CDR3? If so, then those are helpful (I'm not convinced that they're always there, but then again it may be a helpful filter). In frame is a must, yes.

psathyrella commented 7 years ago

I do no filtering based on functionality. Easy to change that, of course, if we want to.

psathyrella commented 7 years ago

linking to a thread on b-t.cr: http://b-t.cr/t/mutated-invariant-codons-and-functionality/306

tl;dr: most everyone agrees cyst and tryp/phen need to be in frame with respect to the start of V to be functional, but tryp/phen (and possibley cyst) could potentially be mutated to other similar amino acids. Oh, and yes, these are the codons that start and end CDR3, with imgt definitions.

The default of to keeping every sequence for which I can find a plausible BCR alignment is 'cause there's a fair number of studies where people are specifically looking at non-functional rearrangements, and I don't want to make them just disappear.

matsen commented 7 years ago

Thanks, everyone!

@metasoarous please check to make sure that we are applying an in-frame filter and add one if it's not there. Git grepping cft I don't see the in_frames appearing in the code, but I might be missing it somewhere.

Personally I would advocate for a stricter filter on stop codons. We know that there are lots of sequence errors in these sequences, and we are already using filters to cut out sequences, so why not throw out the bad ones?

lauradoepker commented 7 years ago

@matsen do you mean stricter than allowing one stop codon? i.e. not allowing any stop codons?

matsen commented 7 years ago

@lauranoges exactly.

matsen commented 7 years ago

@metasoarous I think this is fine to close now that you've implemented the frame filter, with the understanding that @lauranoges will try out various filters on her own.

metasoarous commented 7 years ago

Reposting from the other thread (#179), because this really belongs here...

I was just reviewing, and realized that what I've been saying about how I've been filtering out sequences isn't quite right.

What we're actually doing is this: https://github.com/matsengrp/cft/blob/41247dea8a8729750dde8364984512739f4e0bf4/bin/process_partis.py#L157-L167. As mentioned in the highlighted comment, we're removing sequences for which:

I think this was something @psathyrella and I settled on as a temporary solution at a point where we realized that the productivity information partis was sticking in the output was flawed (IIRC, it was not based on the indel reversed sequences or some such). Assuming we don't go the route of taking care of this filtering upstream of cft, would it make sense to switch to the updated productivity information coming out of partis?

matsen commented 7 years ago

I very much think that we need to make sure sequences are in frame. Unless @psathyrella says otherwise I have no reason to doubt partis' ability to evaluate that.

psathyrella commented 7 years ago

Chris is right that about six (er, +/- 3) months ago we realized that my functionality information wasn't accounting for shm indels properly (i.e. the functionality information was partly wrong for the ~1% of sequences with SHM indels). It's been fixed for a while though.

metasoarous commented 7 years ago

So do we think filtering by in_frame is enough? I seem to recall a discussion with either @psathyrella or @lauranoges where it was suggested that a sequence with a stop codon towards the end of the sequence might be biologically valid, which is why I didn't just check for presence of stops. Am I remembering/understanding this correctly? If in_frame isn't enough, how many stops should we allow then?

Regarding whether we should do filtering of such things upstream of cft, that would be my preference, all other things being equal. SCons build nests have to be statically/pre-build inferred from source input files. So if the set of build steps inferred for some cluster end up leading to a cluster with fewer than 3 sequences after filtering, but not before, we end up failing to build a tree, and we start having to get into weird error handling territory. I can continue to work around this as I've been doing, by ignoring such build errors, but I don't like doing this cause it could end up masking real errors cropping up in the future. I can look at filtering small clusters out in the pre-build phase (computation of the build steps), and this should be pretty easy, particularly if we're just checking for in_frame. But if folks have other reasons for which they'd rather filter such sequences out before cft sees them, it would be no sweat off my back.

lauradoepker commented 7 years ago

We agreed to allow up to one STOP codon and no more. We ( @matsen, @metasoarous @psathyrella ) also recently discussed that if there are many sequences in the partition, we could filter to zero STOP codons, but I'm worried about missing some group that had a PCR error (to a STOP codon) early in the process. Anyway, let's filter to =<1 stop.

psathyrella commented 7 years ago

I think the main reason for filtering on functionality after all the partis stuff is that that stuff requires so much more computing resources -- i.e., if we change our mind again about what to keep or not, that's another week or so of me babysitting jobs through the various bits of the cluster.

metasoarous commented 7 years ago

@psathyrella Yeah, I don't want you to have to do that now. I meant that if it's desired for your next run of the pipeline, I can hobble along on with my current setup for now.

@lauranoges stops =< 1 it is!

metasoarous commented 7 years ago

This is now implemented in terms of in_frame or stops =< 1.