Closed eharkins closed 4 years ago
Seems like a great first issue for @jgallowa07 , mentored by you, don't you think?
And let me know if i should change something either in the implementation or the docs. As I say ^, it needed to happen, since as it was it was inconsistent, but the change definitely breaks backward compatibility which of course makes me twitch.
Yes @matsen I thought the same after posting. @jgallowa07 let's chat about this next time we talk!
Very up for this - seems like as good a place to jump in as any!
This commit has most (all?) of the relevant changes, but if it were me I think the description above ^ would be more useful. But either way!
Thanks @psathyrella . Indeed it might be getting into the weeds for me to have requested the relevant commits but i thought they might be helpful alongside your description. The description makes sense now though (after reminding myself about the different multiplicity-related keys you describe) and I think tells me all I need to know after all.
@psathyrella the way we make these pie charts:
for leaves in Olmsted is according to the duplicates
multiplicity information; the total multiplicity for a given sequence from partis (a leaf in an Olmsted tree) is broken up and attributed (colored according) to each of the timepoints corresponding to the duplicates
for that sequence. Is this information no longer available / potentially inaccurate?
If so we'll need to consider how to address the biological question these pie charts were answering in a different way (or remove them altogether if it turns out they were not providing a valued analysis).
Nothing about the 'duplicates' key changed. I'm not sure which multiplicity you're saying it currently breaks up, because the problem with the previous version was that the 'multiplicity' that partis outputs was only the input meta info multiplicity, and didn't include any info from 'duplicates'. This was a) confusing since if you're only looking at the output file, you'd expect 'multiplicity' to include all multiplicity info, and b) necessarily wrong in some instances, since there was no way to keep track of duplicate sequences that had non-1 multiplicity from input info. So the new 'multiplicity' includes the sum of duplicates, and of the input meta info 'multiplicities' for each of those duplicates.
Currently, we do something like this:
x
in a partis cluster annotation's unique_ids
key:y
in [x + x['duplicates']]
(i.e. all the unique_ids
with that observed sequence):timepoint
, multiplicity
fory
in the per-sequence-meta-file
for the sample (which I believe comes from https://github.com/psathyrella/partis/blob/dev/docs/subcommands.md#input-meta-info - and of course this only happens when that per-sequence-meta-file
is specified in the CFT input)multiplicity
s for each y
by timepoint
as way of tracking x
's multiplicity by timepoints where we observed it; each of these timepoints corresponds to a section in x
's pie chart in Olmsted.It sounds like using
the new 'multiplicity' [which] includes the sum of duplicates, and of the input meta info 'multiplicities' for each of those duplicates.
is a good way to get an accurate / all-encompassing multiplicity value per-sequence.
Currently I don't think we use that key at all, but rather do the process I describe to get information about the timepoints (from duplicates) that contribute to the multiplicity
value. Given that, it doesn't sound your described changes necessitate changing this process (unless there is a better way to do it - which there very well might be)?
yeah sounds like you don't use the 'multiplicities' key, which is the only thing that changed.
After this ^ and some extra discussion with Duncan, it doesn't seem like anything needs to change in CFT. In summary, we basically do the same aggregation in CFT of duplicate sequences' multiplicity that is done in partis, except we preserve the information about how much each timepoint contributed to the total. Ideally we would just use the number from partis but because we still care about
the information about how much each timepoint contributed to the total
for the pie charts (see above), and because this information will likely not become available in partis output, we will continue to use our code instead of using the total multiplicities values from partis.
Before we run on more data "in production" (before we run samples from the next subject from the Overbaugh lab), it might be good to validate that our aggregation of multiplicities == that of partis. This will require Duncan rerunning a sample with input meta info and then me or @jgallowa07 running that partis output through CFT and checking the two values are the same.
I'm renaming the issue accordingly and leaving it open until we've done that.
Sweet! Duncan, let us know when the rerun is done and we'll have Jared run CFT with Eli's guidance.
Nevermind, sorry, I remembered wrong -- cft doesn't get the multiplicity info from partis, it takes it directly from the datascripts-copied file from vlad.
Closing this for now since validating as I was describing doesn't seem to make sense given the above (i.e. because partis doesn't normally do aggregation of multiplicity in a way that would make sense to compare with how CFT currently does it).
Just for the sake of posterity, it's not that partis aggregates the multiplicity information in one way or another (although the change that started this issue makes it so it now does almost the same thing as cft), it's that the way we have always run on laura's data doesn't involve partis interacting with the multiplicity information at all. We could tell partis to read the multiplicity info, and then the info would be in the partis output file, but that seems like a lot of work just to check that cft is correctly adding a few numbers together.
From @psathyrella:
This sounds like it will require a bunch of changes to this code https://github.com/matsengrp/cft/blob/master/bin/process_partis.py#L246 but will probably end up being simpler at the end of the day. Will take a more detailed look and try to make that happen when I get a chance.