Added filtering by self ligandome, minor refactoring

iskandr commented 9 years ago

Added LazyLigandomeDict which loads peptide sets from a directory with allele-specific peptide files
Extended BindingPrediction into EpitopePrediction with extra fields {contains_mutant_residues, occurs_in_self_ligandome, mutant}
Split out filtering by expression into filters
Added unit tests for filtering functions
Split mutant epitope prediction tests into separate module for class II predictions

tavinathanson commented 9 years ago

Reviewed 18 of 18 files at r1. Review status: 17 of 18 files reviewed at latest revision, 21 unresolved discussions, some commit checks failed.

^{.travis.yml, line 32 [r1] (raw file):} Yeah, I finally concede that this is the better approach. Was going to change it myself!

^{test/data.py, line 41 [r1] (raw file):} This name feels too general. (Confused me when reading the tests, as I couldn't figure out what variants was.) Maybe test_variants or cancer_test_variants? Or something? Same with the two names below?

^{test/data/tiny_test_ligandome_dir/A0201, line 1 [r1] (raw file):} Maybe good to have at least two subdirectories to test that both are used?

Also, good to try a different allele format for the other subdirectory?

^{test/test_effect_expression_filters.py, line 52 [r1] (raw file):} How about testing equal to threshold? <3 edge cases.

^{topiary/epitope_collection_helpers.py, line 90 [r1] (raw file):} Again, I think the x is unhelpful for readability since it's not obvious what sort of class x is.

^{topiary/epitope_prediction.py, line 32 [r1] (raw file):} Confusing to have one value that is actually a function of two other values in the tuple. What if this is just a normal class and mutant() is a method? Alternatively, are the former two necessary to store here?

^{topiary/epitope_prediction.py, line 35 [r1] (raw file):} Given that epitope predictions do still bind, I think EpitopeBindingPrediction would be more clear.

^{topiary/filters.py, line 71 [r1] (raw file):} Why isn't this called gene_id?

^{topiary/filters.py, line 90 [r1] (raw file):} Seems like more code re-use is possible between this and the function above it.

^{topiary/filters.py, line 132 [r1] (raw file):} Are these tested?

^{topiary/filters.py, line 155 [r1] (raw file):} Was hard for me to parse the double negatives in this sentence. Maybe: "If False, drop epitope predictions that aren't neoepitopes. In other words: drop predictions for epitopes that are not mutated or are in the self-ligandome." Accurate?

^{topiary/filters.py, line 158 [r1] (raw file):} Just a copy paste from before, right?

^{topiary/mutant_epitope_predictor.py, line 30 [r1] (raw file):} Unchanged from the old extract_mutant_peptides? Also, a dict of triples without any classes feels likes it's starting to become fairly complex. Why not just have result[effect] = seq[seq_start_offset:seq_end_offset], which appears to be what the caller is doing? (I imagine there is a reason, so I'm mostly just curious.)

Edit: oh, I guess you do use seq and start_offset in other places. I don't like how pervasive the triple is, though. Seems prone to error/not good for readability. Thoughts?

^{topiary/mutant_epitope_predictor.py, line 63 [r1] (raw file):} Sneaky; why?

Edit: oh, because you want to keep the self-ligandome matches? This feels unintuitive to me. Thoughts on a separate keep_self_ligandome_epitopes?

^{topiary/mutant_epitope_predictor.py, line 176 [r1] (raw file):} Did you change any functionality here? Looks like just code re-arrangement?

^{topiary/mutant_epitope_predictor.py, line 182 [r1] (raw file):} Given the defaults of this method, it seems pretty easy to accidentally get top_priority_effects when you actually want to use expression data by just forgetting to pass it in. Maybe you can remove the default? Or, other ideas? (High level comment: for software like this, I wonder how useful vs. dangerous some of these defaults really are.)

^{topiary/mutant_epitope_predictor.py, line 212 [r1] (raw file):} x is not a helpful name! Took me a while to figure out what it was representing. binding_prediction?

^{topiary/mutant_epitope_predictor.py, line 226 [r1] (raw file):} Ahh! This seems super confusing. I think we should always avoid overwriting fields for subclasses/extended classes, since I think some people are bound to reasonably assume that the fields with the same name are equivalent between related classes.

Can you just call the updated source_sequence and offset something new? e.g. epitope_source_sequence and epitope_offset.

^{topiary/mutant_epitope_predictor.py, line 232 [r1] (raw file):} Is this tested? I'm always concerned about off-by-1s.

^{topiary/mutant_epitope_predictor.py, line 248 [r1] (raw file):} epitope binding predictions?

^{topiary/mutant_epitope_predictor.py, line 288 [r1] (raw file):} Just to make sure I understand: we've got epitopes_from_variants and epitopes_from_mutation_effects. The former uses the latter, though it pre-filters so we can do a lot less effect prediction. Each use slightly different logic for FPKM filtering, necessitating some duplication of code/tests, because the latter needs to get the variant of each effect while the former just uses the variant.

What are the use cases for people using epitopes_from_mutation_effects and not being able to just start with epitopes_from_variants?

Comments from the review on Reviewable.io

iskandr commented 9 years ago

Review status: 17 of 18 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.

^{test/data.py, line 41 [r1] (raw file):} :+1:

^{test/data/tiny_test_ligandome_dir/A0201, line 1 [r1] (raw file):} Subdirectories or files?

^{topiary/epitope_prediction.py, line 32 [r1] (raw file):} I think it's important to know why the epitope was considered wildtype: because it didn't contain a mutant residue or because it did but that sequence occurs elsewhere in the ligandome. I was thinking of breaking away from using namedtuple here, but that leads to a bigger refactoring that's probably best saved for a later PR.

^{topiary/epitope_prediction.py, line 35 [r1] (raw file):} I think that an epitope, by definition, has to bind MHC so that seems redundant.

^{topiary/filters.py, line 71 [r1] (raw file):} Copy paste in action :-)

^{topiary/filters.py, line 90 [r1] (raw file):} I feel the same way but then you end up with much more cryptic code like:

             expression_dict.get(feature_id, 0.0) > expression_threshold
                for feature_id in getattr(x, feature_name)),

which seemed possibly worse than just duplicating but keeping things clearer.

^{topiary/filters.py, line 132 [r1] (raw file):} No, adding tests

^{topiary/filters.py, line 155 [r1] (raw file):} I'll change this description a little. How about: "If True, then all predicted epitopes are returned, even if they don't have a mutation or occur in the self ligandome. Otherwise, return only novel epitopes."?

^{topiary/filters.py, line 158 [r1] (raw file):} Yeah, just pulling this out into its own testable chunks (and thus need to add tests)

^{topiary/mutant_epitope_predictor.py, line 30 [r1] (raw file):} I agree that it's ugly, not sure how to keep this data together otherwise though. I guess I could have something like ProteinSubsequence = namedtuple("ProteinSubsequence", "original_protein_sequence start end"). Would that be better?

^{topiary/mutant_epitope_predictor.py, line 63 [r1] (raw file):} I think that maybe

^{topiary/mutant_epitope_predictor.py, line 176 [r1] (raw file):} Just code rearrangement. Conditionally defining a function and then using it seemed needlessly confusing.

^{topiary/mutant_epitope_predictor.py, line 182 [r1] (raw file):} If you want expression-based transcript selection, you have to pass in something relating to transcript expression. I guess I can make that argument more prominent and remove its default value.

^{topiary/mutant_epitope_predictor.py, line 212 [r1] (raw file):} Renamed

^{topiary/mutant_epitope_predictor.py, line 226 [r1] (raw file):} The update is to remap the source_sequence and offsets onto the full protein. But maybe I can skip that? I don't know if the full protein sequence is actually necessary, I can possibly just keep the positions in the protein sequence in case someone needs to reconstruct it later.

^{topiary/mutant_epitope_predictor.py, line 232 [r1] (raw file):} Good idea, I'm going to add them later to keep this PR manageable: https://github.com/hammerlab/topiary/issues/34

^{topiary/mutant_epitope_predictor.py, line 288 [r1] (raw file):} They used to be one function which seemed cleaner when split apart, I can join them back together if you think that's better.

Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Review status: 17 of 18 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.

^{test/data/tiny_test_ligandome_dir/A0201, line 0 [r1] (raw file):} I meant subdirectories, but files too? Maybe 1 subdir with 1 file and another subdir with 2 files?

^{topiary/epitope_prediction.py, line 0 [r1] (raw file):} I don't think it's obvious that EpitopePrediction is related to BindingPrediction. But I won't push.

^{topiary/epitope_prediction.py, line 0 [r1] (raw file):} I'd agree with breaking away from namedtuple.

^{topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file):} Unclear. Just seemed like a lot of code/testing duplication. You can your judgement on this one.

^{topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file):} I think that makes sense :)

^{topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file):} :+1:

^{topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file):} Offline discussion: we talked about what the default should be, since I was confused about why we were no longer dropping things. Alex clarified that's the point of the EpitopePrediction class; once they get everything, they can filter it down as necessary.

^{topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file):} Not entirely sure what you mean; I'm happy as long as we don't change any of the variables already defined/set.

Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Reviewed 1 of 1 files at r2. Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.

Comments from the review on Reviewable.io

iskandr commented 9 years ago

Review status: 5 of 27 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

^{test/data/tiny_test_ligandome_dir/A0201, line 1 [r1] (raw file):} I just made one with two files, not sure what testing multiple subdirectories adds.

Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Awesome! I'm digging all the tests.

Reviewed 21 of 25 files at r3, 3 of 3 files at r4. Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

^{setup.py, line 58 [r4] (raw file):} Should require the latest mhctools since your test breaks without it.

^{test/test_epitope_filters.py, line 4 [r3] (raw file):} I guess this is easy enough that using mock wouldn't add anything?

^{test/test_epitopes_from_commandline_args.py, line 10 [r3] (raw file):} Given our issues of late, probably always good to throw in more than 1 allele?

^{topiary/epitope_prediction.py, line 26 [r4] (raw file):} Finish your sentence?

^{topiary/epitope_prediction.py, line 28 [r4] (raw file):} As discussed, these names should be worked on. (And the structure of the fields, too.)

Comments from the review on Reviewable.io

iskandr commented 9 years ago

Yeah, also feels nice to break up monolithic piles of logic into smaller testable pieces. Good to merge?

Review status: 23 of 28 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

^{setup.py, line 58 [r4] (raw file):} Yeah, changed it to 0.1.8

^{test/test_epitope_filters.py, line 4 [r3] (raw file):} I can't imagine anything that would add, but maybe I'm being unimaginative :-)

^{test/test_epitopes_from_commandline_args.py, line 10 [r3] (raw file):} Add an HLA-C allele

Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Merge away!

Reviewed 5 of 5 files at r5. Review status: :shipit: all files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from the review on Reviewable.io

JPFinnigan commented 9 years ago

Hey Guys,

Question re: this branch. If I include the --only-novel-epitopes flag do I need to maintain a local self-ligandome file, or are the self-ligandomes for each allele packaged w/ the topiary distribution? If it is the former case, when including --only-novel-epitopes how should I point topiary to the correct self-ligandome file. If it is the latter, does topiary reference the appropriate self-ligandome automatically?

As always, thanks for all of your help!

iskandr commented 9 years ago

Hey @JPFinnigan, I should have asked this before changing the commandline flags, but what do you prefer, this new behavior (keep all epitopes unless give --only-novel-epitopes) or what we had before (drop non-mutant or self epitopes unless given--keep-wildtype-epitopes`)?

Also, Topiary doesn't come with a self-ligandome since ideally it would be matched to whatever binding prediction tool you're also using for mutant peptide-MHC binding. You can optionally supply one (like the NetMHCpan dataset Kipp generated) via --wildtype-ligandome-directory.

JPFinnigan commented 9 years ago

@iskandr Good question.

I can think of reasons to support both default behaviors. I imagine that the typical topiary user will want use the software to identify neoantigens, and in that setting I would expect to drop all non-mutation/self epitopes by default. However, given that the complete filtering the user may want from --only-novel-epitopes (e.g removing epitopes not containing the variant residue, and removing epitopes found elsewhere in the self-ligandome) requires additional input in the form of a --wildtype-ligandome-directory (w/h some users may not have access to), I think the new behavior also makes sense and may in-fact actually be the preferable default behavior.

openvax / topiary

Added filtering by self ligandome, minor refactoring #33