Closed lauradoepker closed 7 years ago
I think the right tact here depends a little bit on what you're trying to get at. If you interested in getting a better sense of the distribution of diversity, we could maybe size the nodes based on how many times a sequence has been seen? And in this case, if we do #125 we may also want to further aggregate sequence counts based on our tree pruning/downsampling prior to dnaml/dnapars.
If, however, what you're really interested in is how many times an exact sequence has been seen in order to decide whether or not it might be subject to sequencing error or some such (or for whatever other technical reasons), then we might not want to aggregate based on the downsampling, and perhaps there would be a better way of representing this information than with node size.
I don't know what the logic is to the -*
suffixes; I've actually been curious about that. Also note that there's yet another level of deduplication when we merge timepoints that have overlapping sequences. So there's lots of fun to be had here.
Great points @metasoarous . I am attracted to both options you laid out -- I'd like to collapse similar/same nodes as you suggested, but when it comes down to singular leaf sequences that we may be picking, I'd still like to know how many times that individual sequence has been seen to avoid sequencing error suspicion. Are both possible? Do we @psathyrella's opinion on the singular leaf seq frequency?
Incidentally, at first pass Brianna's project shows that our node choices yield better-behaving antibodies than our leaf choices, which I guess isn't surprising because the nodes are fairly safe bets regarding sequence (and they actually act as a form of consensus error-correcting).
Very interesting (regarding Brianna's findings)!
I was postponing answering to try and find where vlad sticks that info... but then I didn't see it immediately, but I know it's there somewhere. Anyway, getting vlad's deduplication numbers is just a matter of me or chris pulling the info out of some file we're not using at the moment in /fh/fast/matsen_e/kate-qrs/
. My deduplication numbers are already in the annotation csv, so you could probably use those for testing already.
I don't think I have an opinion on how the info gets presented, though.
@csmall ok figured out the info for vlad's deduplication. The uids that we're using are just n-m
for n
an arbitrary index (well, it's the rank when sorted by number of exemplars), and m
the number of exemplars of that sequence, i.e. how many duplicates.
e.g. 435-19
is the 435th most common sequence in the file, and it occurred 19 times.
Looking in, e.g. this dir:
/fh/fast/matsen_e/data/kate-qrs/processed_data/Hs-LN1-5RACE-IgG/03_fastx_out/
which has a collapsed and non-collapsed fasta. We're using the collapsed fasta as input, and the first five uids are:
>1-1359
>2-739
>3-613
>4-603
>5-522
and if you grep for their corresponding nucleotide sequences in the non-collapsed file, you get the number on the right side of the dash.
@psathyrella Heh; Just now seeing this. Good find! And thanks!
So... to recap here...
Duplication needs to be accounted for by:
Thinking about this a little more it seems that if we are taking duplicity into account in determining how much we "believe" in a sequence, we should maybe also take this into account in our pruning strategy (#125). Perhaps there we could run rppr min_adcl_tree
but take as exemplar whichever sequence has highest duplicity? I could imagine a case where the sequence that min_adcl_tree
picks ends up with much lower duplicity than other sequences in the corresponding cluster. Certainly, this is really what we want for representing the overall phylogenetic diversity; And additionally, adcl
has been shown to somewhat effectively filter out sequences with sequencing errors. But it should still be possible to navigate to other sequences in the cluster based on duplicity? This means some more nuanced interface work, but is probably the direction we need to head to really solve this issue.
This is now merged, and for minadcl downsampling ("thinning") aggregates over timepoints/duplicities for sequences thinned out, based on which kept sequence matches most closely. As discussed above, this potentially conflates a couple of issues: We'd like to know exact duplicities to have a handle on potential sequencing error, but we also want a fuller sense of the distribution of diversity across the tree and across timepoints.
For now, it seems that including the minadcl-thinned duplicity is the most valuable thing, so I'm gonna stick with what we have and close this issue. But if there is concern about us wanting to be able to see exact sequence duplicities, I've set things up so that this data is being preserved, and once we switch to Auspice for tree viewing, we should be able to set things up to toggle between the two metrics.
At group meeting last week, we discussed the importance of knowing if one leaf sequence was sequenced once, a few times, or many times. Vlad's deduplication collapses sequences that are same sequence and same length. @psathyrella 's deduplication additionally collapses those that are same sequence different length.
@psathyrella: Can we track the number of sequences collapsed that are represented by the 'leaf' sequences on the trees? Currently, most of the leaf sequences have a -1 or -2 or even -5 at the end of their names, but this doesn't correspond to their frequency since these individual sequences were chosen at random to represent their duplicate group.
What do you think will work?