revisit subsampling? - Githubissues

matsen commented 7 years ago

At some point it would seem worthwhile to revisit the random subsampling used before building the clonal families in the case that we are interested in every last event that went into a seed lineage.

Perhaps a good thing to do would be for @lauranoges to pick a clonal family for which we would like better resolution on the path to maturity (but doesn't care so much about other sequences in the clonal family), and we can see what happens if @psathyrella just looks for sequences quite close to the seed.

metasoarous commented 7 years ago

Another thing we brought up in discussion about subsampling is that if we thought about it as a clustering problem, we could aggregate the duplicity and timepoint metadata for each cluster. @lauranoges et al. would like to avoid (e.g.) a situation where a small number of sequences from one timepoint belonging to some clonal family got completely left out of the subsampling, making it look like there were only hits from one timepoint in the corresponding cftweb tree(s). Something like UCLUST might do the trick here.

However, I think this heads in a little bit of a different direction than what you're talking about Erick (looking for sequences close to the seeds), and kind of plays into the seedlineage vs minadcl split. Would it be crazy to run partis on two subsampling strategies?

psathyrella commented 7 years ago

I think before adding more to the docket of things that need to be run we should have a better idea of how it could actually change a biological conclusion that we want to make. To be clear, the effect of this is that, on some small fraction of the samples, we could double or triple the size of the final clusters. The vast majority of these clusters are either massive, or of trivial size (i.e. just the seed sequence). In the former case, if going from, say, 2k to 4k sequences changes our conclusions, something's badly wrong in our analysis. In the latter, we're not making any conclusions based on a single/few sequences, so that shouldn't change anything either.

matsen commented 7 years ago

Thanks, all. @psathyrella, yes, I definitely want to think this through clearly.

The case in which this could make a difference is, like Chris said, just when we are looking very closely at a seed lineage. Even in the big clusters there can be relatively few sequences that branch directly off of the root to seed path. If we can even get one more of those, that provides another intermediate that can be tested in the lab. If we double the size of the cluster that doubles the potential for getting those close to seed lineage path. The flip-side is that we can be pretty strict in our clustering-- if the inferred naive sequence is too far from the seed naive, we can toss it.

I was also thinking this weekend that if we are down-sampling anyway, we might as well go for a stricter quality control on the pre-processing side.

Thoughts, @lauranoges ?

lauradoepker commented 7 years ago

Yes @matsen and @psathyrella : We might as well downsample to a higher-quality sequence set (throw out stop codons). However, if the STOP codon arose early, then the entire population of a given sequence would get thrown out, which is biased and could thwart us later without our knowledge... which is scary.

QA255.105-Vh and -Vk are examples of lineages that we are really interested in. Also BF520.1-Vh and -Vk.

I didn't realize we were downsampling until it was mentioned last week. Are we downsampling randomly @psathyrella or are we downsampling by intentionally picking samples that are _____? (close to seed? good quality? or what?)

psathyrella commented 7 years ago

yeah randomly

metasoarous commented 7 years ago

I was just reviewing, and realized that what I've been saying about how I've been filtering out sequences isn't quite right.

What we're actually doing is this: https://github.com/matsengrp/cft/blob/41247dea8a8729750dde8364984512739f4e0bf4/bin/process_partis.py#L157-L167. As mentioned in the highlighted comment, we're removing sequences for which:

there are stop codons and
the length is not a multiple of three

I think this was something @psathyrella and I settled on as a temporary solution at a point where we realized that the productivity information partis was sticking in the output was flawed (IIRC, it was not based on the indel reversed sequences or some such). Assuming we don't go the route of taking care of this filtering upstream of cft, would it make sense to switch to the updated productivity information coming out of partis?

metasoarous commented 6 years ago

@lauranoges @psathyrella What's the current status of this issue? I seem to recall @psathyrella did some tinkering with the downsampling, so are we good to close here?

psathyrella commented 6 years ago

yeah, no longer downsampling.

metasoarous commented 6 years ago

Great; thanks! Closing!

matsengrp / cft

revisit subsampling? #179