openvax / topiary

Predict mutated T-cell epitopes from sequencing data
Apache License 2.0
27 stars 9 forks source link

Added filtering by self ligandome, minor refactoring #33

Closed iskandr closed 9 years ago

iskandr commented 9 years ago

Review on Reviewable

tavinathanson commented 9 years ago

Reviewed 18 of 18 files at r1. Review status: 17 of 18 files reviewed at latest revision, 21 unresolved discussions, some commit checks failed.


.travis.yml, line 32 [r1] (raw file): Yeah, I finally concede that this is the better approach. Was going to change it myself!


test/data.py, line 41 [r1] (raw file): This name feels too general. (Confused me when reading the tests, as I couldn't figure out what variants was.) Maybe test_variants or cancer_test_variants? Or something? Same with the two names below?


test/data/tiny_test_ligandome_dir/A0201, line 1 [r1] (raw file): Maybe good to have at least two subdirectories to test that both are used?

Also, good to try a different allele format for the other subdirectory?


test/test_effect_expression_filters.py, line 52 [r1] (raw file): How about testing equal to threshold? <3 edge cases.


topiary/epitope_collection_helpers.py, line 90 [r1] (raw file): Again, I think the x is unhelpful for readability since it's not obvious what sort of class x is.


topiary/epitope_prediction.py, line 32 [r1] (raw file): Confusing to have one value that is actually a function of two other values in the tuple. What if this is just a normal class and mutant() is a method? Alternatively, are the former two necessary to store here?


topiary/epitope_prediction.py, line 35 [r1] (raw file): Given that epitope predictions do still bind, I think EpitopeBindingPrediction would be more clear.


topiary/filters.py, line 71 [r1] (raw file): Why isn't this called gene_id?


topiary/filters.py, line 90 [r1] (raw file): Seems like more code re-use is possible between this and the function above it.


topiary/filters.py, line 132 [r1] (raw file): Are these tested?


topiary/filters.py, line 155 [r1] (raw file): Was hard for me to parse the double negatives in this sentence. Maybe: "If False, drop epitope predictions that aren't neoepitopes. In other words: drop predictions for epitopes that are not mutated or are in the self-ligandome." Accurate?


topiary/filters.py, line 158 [r1] (raw file): Just a copy paste from before, right?


topiary/mutant_epitope_predictor.py, line 30 [r1] (raw file): Unchanged from the old extract_mutant_peptides? Also, a dict of triples without any classes feels likes it's starting to become fairly complex. Why not just have result[effect] = seq[seq_start_offset:seq_end_offset], which appears to be what the caller is doing? (I imagine there is a reason, so I'm mostly just curious.)

Edit: oh, I guess you do use seq and start_offset in other places. I don't like how pervasive the triple is, though. Seems prone to error/not good for readability. Thoughts?


topiary/mutant_epitope_predictor.py, line 63 [r1] (raw file): Sneaky; why?

Edit: oh, because you want to keep the self-ligandome matches? This feels unintuitive to me. Thoughts on a separate keep_self_ligandome_epitopes?


topiary/mutant_epitope_predictor.py, line 176 [r1] (raw file): Did you change any functionality here? Looks like just code re-arrangement?


topiary/mutant_epitope_predictor.py, line 182 [r1] (raw file): Given the defaults of this method, it seems pretty easy to accidentally get top_priority_effects when you actually want to use expression data by just forgetting to pass it in. Maybe you can remove the default? Or, other ideas? (High level comment: for software like this, I wonder how useful vs. dangerous some of these defaults really are.)


topiary/mutant_epitope_predictor.py, line 212 [r1] (raw file): x is not a helpful name! Took me a while to figure out what it was representing. binding_prediction?


topiary/mutant_epitope_predictor.py, line 226 [r1] (raw file): Ahh! This seems super confusing. I think we should always avoid overwriting fields for subclasses/extended classes, since I think some people are bound to reasonably assume that the fields with the same name are equivalent between related classes.

Can you just call the updated source_sequence and offset something new? e.g. epitope_source_sequence and epitope_offset.


topiary/mutant_epitope_predictor.py, line 232 [r1] (raw file): Is this tested? I'm always concerned about off-by-1s.


topiary/mutant_epitope_predictor.py, line 248 [r1] (raw file): epitope binding predictions?


topiary/mutant_epitope_predictor.py, line 288 [r1] (raw file): Just to make sure I understand: we've got epitopes_from_variants and epitopes_from_mutation_effects. The former uses the latter, though it pre-filters so we can do a lot less effect prediction. Each use slightly different logic for FPKM filtering, necessitating some duplication of code/tests, because the latter needs to get the variant of each effect while the former just uses the variant.

What are the use cases for people using epitopes_from_mutation_effects and not being able to just start with epitopes_from_variants?


Comments from the review on Reviewable.io

iskandr commented 9 years ago

Review status: 17 of 18 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


test/data.py, line 41 [r1] (raw file): :+1:


test/data/tiny_test_ligandome_dir/A0201, line 1 [r1] (raw file): Subdirectories or files?


topiary/epitope_prediction.py, line 32 [r1] (raw file): I think it's important to know why the epitope was considered wildtype: because it didn't contain a mutant residue or because it did but that sequence occurs elsewhere in the ligandome. I was thinking of breaking away from using namedtuple here, but that leads to a bigger refactoring that's probably best saved for a later PR.


topiary/epitope_prediction.py, line 35 [r1] (raw file): I think that an epitope, by definition, has to bind MHC so that seems redundant.


topiary/filters.py, line 71 [r1] (raw file): Copy paste in action :-)


topiary/filters.py, line 90 [r1] (raw file): I feel the same way but then you end up with much more cryptic code like:

             expression_dict.get(feature_id, 0.0) > expression_threshold
                for feature_id in getattr(x, feature_name)),

which seemed possibly worse than just duplicating but keeping things clearer.


topiary/filters.py, line 132 [r1] (raw file): No, adding tests


topiary/filters.py, line 155 [r1] (raw file): I'll change this description a little. How about: "If True, then all predicted epitopes are returned, even if they don't have a mutation or occur in the self ligandome. Otherwise, return only novel epitopes."?


topiary/filters.py, line 158 [r1] (raw file): Yeah, just pulling this out into its own testable chunks (and thus need to add tests)


topiary/mutant_epitope_predictor.py, line 30 [r1] (raw file): I agree that it's ugly, not sure how to keep this data together otherwise though. I guess I could have something like ProteinSubsequence = namedtuple("ProteinSubsequence", "original_protein_sequence start end"). Would that be better?


topiary/mutant_epitope_predictor.py, line 63 [r1] (raw file): I think that maybe


topiary/mutant_epitope_predictor.py, line 176 [r1] (raw file): Just code rearrangement. Conditionally defining a function and then using it seemed needlessly confusing.


topiary/mutant_epitope_predictor.py, line 182 [r1] (raw file): If you want expression-based transcript selection, you have to pass in something relating to transcript expression. I guess I can make that argument more prominent and remove its default value.


topiary/mutant_epitope_predictor.py, line 212 [r1] (raw file): Renamed


topiary/mutant_epitope_predictor.py, line 226 [r1] (raw file): The update is to remap the source_sequence and offsets onto the full protein. But maybe I can skip that? I don't know if the full protein sequence is actually necessary, I can possibly just keep the positions in the protein sequence in case someone needs to reconstruct it later.


topiary/mutant_epitope_predictor.py, line 232 [r1] (raw file): Good idea, I'm going to add them later to keep this PR manageable: https://github.com/hammerlab/topiary/issues/34


topiary/mutant_epitope_predictor.py, line 288 [r1] (raw file): They used to be one function which seemed cleaner when split apart, I can join them back together if you think that's better.


Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Review status: 17 of 18 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


test/data/tiny_test_ligandome_dir/A0201, line 0 [r1] (raw file): I meant subdirectories, but files too? Maybe 1 subdir with 1 file and another subdir with 2 files?


topiary/epitope_prediction.py, line 0 [r1] (raw file): I don't think it's obvious that EpitopePrediction is related to BindingPrediction. But I won't push.


topiary/epitope_prediction.py, line 0 [r1] (raw file): I'd agree with breaking away from namedtuple.


topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file): Unclear. Just seemed like a lot of code/testing duplication. You can your judgement on this one.


topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file): I think that makes sense :)


topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file): :+1:


topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file): Offline discussion: we talked about what the default should be, since I was confused about why we were no longer dropping things. Alex clarified that's the point of the EpitopePrediction class; once they get everything, they can filter it down as necessary.


topiary/mutant_epitope_predictor.py, line 0 [r1] (raw file): Not entirely sure what you mean; I'm happy as long as we don't change any of the variables already defined/set.


Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Reviewed 1 of 1 files at r2. Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


Comments from the review on Reviewable.io

iskandr commented 9 years ago

Review status: 5 of 27 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


test/data/tiny_test_ligandome_dir/A0201, line 1 [r1] (raw file): I just made one with two files, not sure what testing multiple subdirectories adds.


Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Awesome! I'm digging all the tests.


Reviewed 21 of 25 files at r3, 3 of 3 files at r4. Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


setup.py, line 58 [r4] (raw file): Should require the latest mhctools since your test breaks without it.


test/test_epitope_filters.py, line 4 [r3] (raw file): I guess this is easy enough that using mock wouldn't add anything?


test/test_epitopes_from_commandline_args.py, line 10 [r3] (raw file): Given our issues of late, probably always good to throw in more than 1 allele?


topiary/epitope_prediction.py, line 26 [r4] (raw file): Finish your sentence?


topiary/epitope_prediction.py, line 28 [r4] (raw file): As discussed, these names should be worked on. (And the structure of the fields, too.)


Comments from the review on Reviewable.io

iskandr commented 9 years ago

Yeah, also feels nice to break up monolithic piles of logic into smaller testable pieces. Good to merge?


Review status: 23 of 28 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


setup.py, line 58 [r4] (raw file): Yeah, changed it to 0.1.8


test/test_epitope_filters.py, line 4 [r3] (raw file): I can't imagine anything that would add, but maybe I'm being unimaginative :-)


test/test_epitopes_from_commandline_args.py, line 10 [r3] (raw file): Add an HLA-C allele


Comments from the review on Reviewable.io

tavinathanson commented 9 years ago

Merge away!


Reviewed 5 of 5 files at r5. Review status: :shipit: all files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from the review on Reviewable.io

JPFinnigan commented 9 years ago

Hey Guys,

Question re: this branch. If I include the --only-novel-epitopes flag do I need to maintain a local self-ligandome file, or are the self-ligandomes for each allele packaged w/ the topiary distribution? If it is the former case, when including --only-novel-epitopes how should I point topiary to the correct self-ligandome file. If it is the latter, does topiary reference the appropriate self-ligandome automatically?

As always, thanks for all of your help!

iskandr commented 9 years ago

Hey @JPFinnigan, I should have asked this before changing the commandline flags, but what do you prefer, this new behavior (keep all epitopes unless give --only-novel-epitopes) or what we had before (drop non-mutant or self epitopes unless given--keep-wildtype-epitopes`)?

Also, Topiary doesn't come with a self-ligandome since ideally it would be matched to whatever binding prediction tool you're also using for mutant peptide-MHC binding. You can optionally supply one (like the NetMHCpan dataset Kipp generated) via --wildtype-ligandome-directory.

JPFinnigan commented 9 years ago

@iskandr Good question.

I can think of reasons to support both default behaviors. I imagine that the typical topiary user will want use the software to identify neoantigens, and in that setting I would expect to drop all non-mutation/self epitopes by default. However, given that the complete filtering the user may want from --only-novel-epitopes (e.g removing epitopes not containing the variant residue, and removing epitopes found elsewhere in the self-ligandome) requires additional input in the form of a --wildtype-ligandome-directory (w/h some users may not have access to), I think the new behavior also makes sense and may in-fact actually be the preferable default behavior.