matsengrp / cft

Clonal family tree
5 stars 3 forks source link

Fix unchecked index in merge_selection_metrics.py #264

Closed eharkins closed 5 years ago

eharkins commented 5 years ago

See https://github.com/matsengrp/cft/pull/261#discussion_r255634321

When we merge selection metrics from a partis annotate command, we take events[0] instead of finding the appropriate cluster annotation as is done in process_partis.py

Duncan said:

It pains my soul to use an index on a list whose length I'm not verifying. That aside, I would strongly recommend using the existing functions in partis for reading partis output files, like you're doing in process_partis.py, you'll be much less likely to run into problems.

That doesn't tell you which event to take, though -- that's between what command was used to create the file and what you want here. It would be best to choose the event with the right uids that were passed in, to enforce correspondence between the previous calling line and this line reading the output. But it would also be fairly safe to do something like

_, annotation_list, _ = utils.read_output(filename)
if len(annotation_list) != 1:
    raise Exception('expected one event from %s but got %d' % (filename, len(annotation_list)))
event = annotation_list[0]

@metasoarous were you taking the index 0 event before because partis annotate is called in this case on a cluster-by-cluster basis so there is only ever one event and therefore it's ok to take the first so long as we have a check like Duncan suggests? @psathyrella, since events comes from the results of annotate with--get-tree-metrics on a particular tree, does it make sense to assume its ok to take events[0]? Or should we do something like process_partis.py?

psathyrella commented 5 years ago

It depends what the annotate command is. As I understand it, it's running annotate with --all-seqs-simultaneous, so the idea is there will be one event in the output, but as we've discussed in the past it's impossible to simply force all together in all circumstances, so it seems like there could be more than one event, and assuming any particular sorting of these events would be a bad idea.

But even that aside, it seems needlessly risky to me to set it up so that if we ever change the annotate command but forget to update this that it will break horribly. I think it's much better to simply enforce that what you're expecting to get back from the annotate command is what you get.

eharkins commented 5 years ago

Makes sense, thanks!

metasoarous commented 5 years ago

@psathyrella This seems to further highlight to me the value of pulling out the lbi/lbr computation code from the annotate command. It feels a little silly to be worrying about multiple events for a computation which really just depends on an input tree.

psathyrella commented 5 years ago

@csmall well yes, in principle the bare lb metrics only depend on the tree, but the stuff around them depends on other things that are in the annotation, e.g. affinity and multiplicity. I think I dug up the details the first time we discussed this, but I agree it is worth eventually separating the two, but it requires a bunch of code duplication, since we'll have to have the infrastructure to pull the other info out of the annotation, pass it in separately, verify it's sensible, and then put it back into the annotation after. And as context, cft is not the only place the lb functions are getting called from. It's on my todo list, but I won't get to it in the next couple days.

metasoarous commented 5 years ago

LOL! Once again, I am not csmall here. Sorry Craig...

I'm still not sure what you mean by "the stuff around them", though I do remember us talking about it at lunch. If there's anything you can link us to here for posterity that would be great.

I'm all for reducing code duplication when it doesn't overly complicate things. It would be great if there was a way that the additional information could be optionally passed in as an API call or some such, as needed. But I realize it's sometimes easier said than done and leave it up to you what the best solution is.

Glad this is on your todo list; no major rush as far as I'm concerned.