Open jburos opened 7 years ago
Thanks @tavinathanson! This is really helpful.
Re: the isovar change, I wasn't seeing this with the neoantigen case but I did see it when in an example I was working with - which calculated a weighted sum of variants where the weight was based on the level of expression. If you're interested in seeing the specific function, it's been pushed to a development branch here.
I haven't gone back to look at how this might impact the filtering by expression -- I'm not sure it does, since there are "be careful with isovar the cache doesn't depend on parameters" messages throughout the code! -- but it does concern me that the cache may have been initialized with a partially-filtered set of variants, leading to underestimating expressed variants on subsequent calls with a more relaxed filter. If this cache is error-prone, we would probably want to restructure this function to only ever cache using an unfiltered set of variants & filter by variants later.
Definitely open to feedback since I want to make sure we get this right.
@jburos what do you mean by "that the cache may have been initialized with a partially-filtered set of variants"? Where does that happen?
@tavinathanson I don't think it would happen in our normal workflow, but we don't know what variants are passed into this function the first time it's called - it could be called directly by the user, with only indels
. In that case the indels-only would be cached & used on subsequent calls (e.g. for expressed_snv_count
).
Otherwise the only two places this is used within the code include load_neoantigens & the variant_expressed_filter
, neither of which passes a filtered set of variants to load_single_patient_isovar
.
@jburos gotcha, makes sense! Maybe we should put https://github.com/hammerlab/cohorts/blob/master/cohorts/cohort.py#L989 inside this function?
Yep, that's what I was thinking as well. Then there's always an option to filter by the variants the user provides after reading from the cache (analogous to the way other cached-items work).
But, would this mean you don't want to cache at all based on the variants or epitope-lengths given? Not really convinced we want to give up on that either (though, epitope-lengths could be done easily since they're numeric).
@jburos I think we can have both of those things. Caching on the isovar
results is useful in terms of time savings, for sure; and this does it in a better way. And loading in unfiltered variants addresses other potential bugs.
So, I tried to add a filter_by_variants
step to the load_single_patient_isovar
process (based on a portion of code from isovar.allele_reads
here), but on further thought it seems much cleaner to leave this logic within isovar
. Instead I now have a phantom _load_single_patient_isovar_unfiltered
function which isn't used anywhere. I will likely remove this.
Also ended up tweaking our logging code (per #246) so that I could confirm that the cache-files were indeed being re-used without problems. LMK if you have a strong opinion about this. I can revert it, just needed it temporarily so that I could actually see my debug messages.
@tavinathanson no rush, just FYI I'm done making updates on this PR for now.
Well, I did that at first -- then undid it. It doesn't seem right to me to repeat the logic that's in isovar -- better to keep it in isovar (ie if the logic changes in isovar, wouldn't we then just repeat it here?).
As it stands, we never have a use case for querying on all variants -- in every case we pull the variants in before calling this function -- so if it's confusing I'd prefer to just remove the _unfiltered
function & leave the current one as-is. What do you think -- does that make sense?
Added a variety of debug messages, also modified isovar cache so that the cached file varies according to the input variants provided. In the absence of this, I was seeing the same number of expressed variants irrespective of which filter_fn was applied.
Finally, captured the case documented in #247 when trying to predict neoantigens using
cohorts.functions.expressed_neoantigen_count
when no variants were expressed.