matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Validate parsing of partis multiplicity #310

Closed eharkins closed 4 years ago

eharkins commented 4 years ago

From @psathyrella:

I've changed the way partis handles the 'multiplicities' key, which I think affects cft/olmsted, can you lmk if this causes any problems? Previously the 'multiplicities' key held multiplicity info that was passed in with --input-metafname (which we used for laura's data), and the 'duplicates' key held the sequence names of sequences that partis collapsed after removing constant/non-coding regions. Having them completely separate like this wasn't consistent, becuase 1) there wasn't a reliable way to combine them, and 2) any 'multiplicities' info from collapsed duplicates was lost. So I changed it so when i collapse duplicates after trimming constant regions, they and their multiplicities get added to the non-collapsed sequence's multiplicity. The net result is that where previoulsy in partis output you would've have to add together 'multiplicities' and len 'duplicates' (and it could've been wrong), now you should just use 'multiplicities', and it'll include duplicate info https://github.com/psathyrella/partis/blob/dev/docs/subcommands.md#input-meta-info

This sounds like it will require a bunch of changes to this code https://github.com/matsengrp/cft/blob/master/bin/process_partis.py#L246 but will probably end up being simpler at the end of the day. Will take a more detailed look and try to make that happen when I get a chance.

matsen commented 4 years ago

Seems like a great first issue for @jgallowa07 , mentored by you, don't you think?

psathyrella commented 4 years ago

And let me know if i should change something either in the implementation or the docs. As I say ^, it needed to happen, since as it was it was inconsistent, but the change definitely breaks backward compatibility which of course makes me twitch.

eharkins commented 4 years ago

Yes @matsen I thought the same after posting. @jgallowa07 let's chat about this next time we talk!

jgallowa07 commented 4 years ago

Very up for this - seems like as good a place to jump in as any!

psathyrella commented 4 years ago

This commit has most (all?) of the relevant changes, but if it were me I think the description above ^ would be more useful. But either way!

eharkins commented 4 years ago

Thanks @psathyrella . Indeed it might be getting into the weeds for me to have requested the relevant commits but i thought they might be helpful alongside your description. The description makes sense now though (after reminding myself about the different multiplicity-related keys you describe) and I think tells me all I need to know after all.

eharkins commented 4 years ago

@psathyrella the way we make these pie charts: Screen Shot 2020-05-04 at 2 26 53 PM

for leaves in Olmsted is according to the duplicates multiplicity information; the total multiplicity for a given sequence from partis (a leaf in an Olmsted tree) is broken up and attributed (colored according) to each of the timepoints corresponding to the duplicates for that sequence. Is this information no longer available / potentially inaccurate?

If so we'll need to consider how to address the biological question these pie charts were answering in a different way (or remove them altogether if it turns out they were not providing a valued analysis).

psathyrella commented 4 years ago

Nothing about the 'duplicates' key changed. I'm not sure which multiplicity you're saying it currently breaks up, because the problem with the previous version was that the 'multiplicity' that partis outputs was only the input meta info multiplicity, and didn't include any info from 'duplicates'. This was a) confusing since if you're only looking at the output file, you'd expect 'multiplicity' to include all multiplicity info, and b) necessarily wrong in some instances, since there was no way to keep track of duplicate sequences that had non-1 multiplicity from input info. So the new 'multiplicity' includes the sum of duplicates, and of the input meta info 'multiplicities' for each of those duplicates.

eharkins commented 4 years ago

Currently, we do something like this:

  1. for a given sequence with unique_id x in a partis cluster annotation's unique_ids key:
  2. for each unique_id y in [x + x['duplicates']] (i.e. all the unique_ids with that observed sequence):
  3. look up the timepoint, multiplicity fory in the per-sequence-meta-file for the sample (which I believe comes from https://github.com/psathyrella/partis/blob/dev/docs/subcommands.md#input-meta-info - and of course this only happens when that per-sequence-meta-file is specified in the CFT input)
  4. sum up the multiplicitys for each y by timepoint as way of tracking x's multiplicity by timepoints where we observed it; each of these timepoints corresponds to a section in x's pie chart in Olmsted.

It sounds like using

the new 'multiplicity' [which] includes the sum of duplicates, and of the input meta info 'multiplicities' for each of those duplicates.

is a good way to get an accurate / all-encompassing multiplicity value per-sequence.

Currently I don't think we use that key at all, but rather do the process I describe to get information about the timepoints (from duplicates) that contribute to the multiplicity value. Given that, it doesn't sound your described changes necessitate changing this process (unless there is a better way to do it - which there very well might be)?

psathyrella commented 4 years ago

yeah sounds like you don't use the 'multiplicities' key, which is the only thing that changed.

eharkins commented 4 years ago

After this ^ and some extra discussion with Duncan, it doesn't seem like anything needs to change in CFT. In summary, we basically do the same aggregation in CFT of duplicate sequences' multiplicity that is done in partis, except we preserve the information about how much each timepoint contributed to the total. Ideally we would just use the number from partis but because we still care about

the information about how much each timepoint contributed to the total

for the pie charts (see above), and because this information will likely not become available in partis output, we will continue to use our code instead of using the total multiplicities values from partis.

Before we run on more data "in production" (before we run samples from the next subject from the Overbaugh lab), it might be good to validate that our aggregation of multiplicities == that of partis. This will require Duncan rerunning a sample with input meta info and then me or @jgallowa07 running that partis output through CFT and checking the two values are the same.

I'm renaming the issue accordingly and leaving it open until we've done that.

matsen commented 4 years ago

Sweet! Duncan, let us know when the rerun is done and we'll have Jared run CFT with Eli's guidance.

psathyrella commented 4 years ago

Nevermind, sorry, I remembered wrong -- cft doesn't get the multiplicity info from partis, it takes it directly from the datascripts-copied file from vlad.

eharkins commented 4 years ago

Closing this for now since validating as I was describing doesn't seem to make sense given the above (i.e. because partis doesn't normally do aggregation of multiplicity in a way that would make sense to compare with how CFT currently does it).

psathyrella commented 4 years ago

Just for the sake of posterity, it's not that partis aggregates the multiplicity information in one way or another (although the change that started this issue makes it so it now does almost the same thing as cft), it's that the way we have always run on laura's data doesn't involve partis interacting with the multiplicity information at all. We could tell partis to read the multiplicity info, and then the info would be in the partis output file, but that seems like a lot of work just to check that cft is correctly adding a few numbers together.