hammerlab / guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Apache License 2.0
83 stars 21 forks source link

Fix ordering of tumor and normal read data #575

Closed arahuja closed 7 years ago

arahuja commented 7 years ago

In https://github.com/hammerlab/guacamole/commit/995af3dbc9c7dcb595c4389a87b4a6fce940cfebthe order of the tumor read and normal reads was swapped, mislabeling tumor as normal and vice-versa.

At some point it seems the pileups returned from pileupFlatMapTwoSamples were also flipped.


This change is Reviewable

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 75.157% when pulling 89e49b9f5015dedb12f722bdc52aa1cb9aa8193d on tn-flip into ec529d58eece5d93445be772cb7b36d2425cf887 on master.

ryan-williams commented 7 years ago

Thanks for catching this, I think all that happened is that I declared the {normal,tumor} order here and didn't change the various other places! pileupFlatMapTwoSamples is agnostic to the sample type, afaict.

I was working on some penance tests and things around this to add here but the scope was creeping a bit so I'll send those separately, so we can merge this in now.

arahuja commented 7 years ago

Yea, pileupFlatMapTwoSamples is agnostic to the sample type, but returns the pileups in order of SampleId which is set based on the ordering of reading in the files. ... Somewhat convoluted and probably makes sense to make that mapping more explicit than relying on ordering. Took me a while to figure out which would be first or second.

ryan-williams commented 7 years ago

based on the ordering of reading in the files … makes sense to make that mapping more explicit than relying on ordering

This was definitely a spooky bug; I spent some time thinking about it and auditing the code paths and have a few takeaways from your having found and fixed it (thanks again).

Mainly: I think the API here is already pretty sane, minimal, and explicit; I implemented one side of it backwards, and the effects hid in our 25% of still-untested code.

I'm working on some PRs to address this and will have them out shortly. Following is a longer discussion.

We have some readset-loading plumbing in ReadSets that we can use for 1, 2, or many samples.

We provide an explicit API for the 2-sample, normal+tumor case; the main and only thing it needs to do is not mix up the order between the 2 filenames going in and the 2 read-sets coming out (or, if it does e.g. reverse the order, callers need to know that and receive the outputs accordingly).

On the way in, the user interacts with this API using the CLI args --normal-reads and --tumor-reads, which are pretty explicit; TumorNormalReadsArgs is exactly an adapter from these flags to lower-level, sample-semantics-agnostic ReadSets plumbing.

On the way out, the user receives the loaded read-sets from a function that is pretty explicit about the order in which they are coming out, ReadSets.loadTumorNormalReads, the docs for which say:

Given arguments for two sets of reads (tumor and normal), return a pair of (tumor, normal) read sets.

All occurrences of an arbitrarily-chosen {normal,tumor} or {tumor,normal} ordering in the codebase are confined to those two already-pretty-explicit APIs (which are exactly the entrance to and exit from the lower ReadSets abstraction. Importantly, that ordering does not leak out into PileupFlatMapUtils or the sample-{number,semantics}-specific read-partitioning logic. This is a basically correct state of affairs, imho, as opposed to a world where we needed that order to be consistent in more than two places.

Of course, even with just those minimal two places to worry about, I managed to break things by changing the order on one side and not the other! I'm open to discussing how we might bug-proof things further, but to me this is more a case of "I implemented a sane and reasonably-compartmentalized abstraction incorrectly, and added no tests to verify its correctness" than, say, "a brittle ordering of inputs has proliferated around the code-base", so with your having fixed this one way for these to be out of sync, and my forthcoming sanity-tests of these code-paths, we should be in a pretty stable place.

All that said, I am all for more explicitness in these kinds of matters in general, so I'm happy to discuss ways of doing that. The main opportunity I see for that here would involve embedding normal-ness and tumor-ness into the type system, e.g. with separate classes for "normal read-set" and "tumor read-set". Even then, at some point we would have to map one input (--normal-reads) to the "normal read-set" type and one (--tumor-reads) to the "tumor read-set" type, and the kind of bug you found here could analogously mix things up at that stage.

Otherwise, we have to rely on ourselves and our tests to save ourselves from doing things like:

val (normal, tumor) = functionThatReturnsTumorThenNormalInThatOrderDontMessUpTheOrder_!(…)

which is basically what I did here, and didn't test; sorry again!

Anyway, I'm interested in your thoughts on any/all of this!

arahuja commented 7 years ago

Importantly, that ordering does not leak out into PileupFlatMapUtils or the sample-{number,semantics}-specific read-partitioning logic. This is a basically correct state of affairs, imho, as opposed to a world where we needed that order to be consistent in more than two places.

This is the part that worries me more, as I think right now we do have to be consistent in multiple places. I'd say I'm less worried about the input args load* functions then pileupFlatMapMultipleRDDs as it seems the ordering does leak to that? Perhaps I've read this wrong, but pileupFlatMapMultipleRDDs expects a functions (Pileup, Pileup) => and it is not obvious which one would correspond to tumor/normal. The one that comes first in the tuple depends on the SampleId, (right? perhaps I have this wrong) and the SampleId depends on the order in which the files were opened?

We could have the user-code (the function passed into pileupFlatMap not rely on the ordering and check sample names I guess?

ryan-williams commented 7 years ago

Yea, good point, the loadTumorNormalReads is actually pointless right now, the real place where the unpacking order matters is when emerging from pileupFlatMapTwoSamples; I'll kill the former and better-document the latter's role when I add these tests.

I think rewording what I said before from "SomaticStandard needs to receive the two loaded read-sets in the right order" to "SomaticStandard needs to receive the two pileups in the right order" makes everything I said still apply: there's one entrance (--{normal,tumor}-reads flags) and one exit ((pileupNormal, pileupTumor) => … param-names) to the underlying machinery that have to match.

In a world where we have a test that sanity-checks this, I'm not that concerned about this continuing to be a problem.

In particular, it's not clear that pulling them apart by sampleName buys us much, since that field is just set in the same code-paths that we're already relying on to keep the order straight.