hammerlab / cohorts

Utilities for analyzing mutations and neoepitopes in patient cohorts
Apache License 2.0
20 stars 4 forks source link

Small fixes to improve estimation of expressed neoantigens #249

Open jburos opened 7 years ago

jburos commented 7 years ago

Added a variety of debug messages, also modified isovar cache so that the cached file varies according to the input variants provided. In the absence of this, I was seeing the same number of expressed variants irrespective of which filter_fn was applied.

Finally, captured the case documented in #247 when trying to predict neoantigens using cohorts.functions.expressed_neoantigen_count when no variants were expressed.

jburos commented 7 years ago

Thanks @tavinathanson! This is really helpful.

Re: the isovar change, I wasn't seeing this with the neoantigen case but I did see it when in an example I was working with - which calculated a weighted sum of variants where the weight was based on the level of expression. If you're interested in seeing the specific function, it's been pushed to a development branch here.

I haven't gone back to look at how this might impact the filtering by expression -- I'm not sure it does, since there are "be careful with isovar the cache doesn't depend on parameters" messages throughout the code! -- but it does concern me that the cache may have been initialized with a partially-filtered set of variants, leading to underestimating expressed variants on subsequent calls with a more relaxed filter. If this cache is error-prone, we would probably want to restructure this function to only ever cache using an unfiltered set of variants & filter by variants later.

Definitely open to feedback since I want to make sure we get this right.

tavinathanson commented 7 years ago

@jburos what do you mean by "that the cache may have been initialized with a partially-filtered set of variants"? Where does that happen?

jburos commented 7 years ago

@tavinathanson I don't think it would happen in our normal workflow, but we don't know what variants are passed into this function the first time it's called - it could be called directly by the user, with only indels. In that case the indels-only would be cached & used on subsequent calls (e.g. for expressed_snv_count).

Otherwise the only two places this is used within the code include load_neoantigens & the variant_expressed_filter, neither of which passes a filtered set of variants to load_single_patient_isovar.

tavinathanson commented 7 years ago

@jburos gotcha, makes sense! Maybe we should put https://github.com/hammerlab/cohorts/blob/master/cohorts/cohort.py#L989 inside this function?

jburos commented 7 years ago

Yep, that's what I was thinking as well. Then there's always an option to filter by the variants the user provides after reading from the cache (analogous to the way other cached-items work).

But, would this mean you don't want to cache at all based on the variants or epitope-lengths given? Not really convinced we want to give up on that either (though, epitope-lengths could be done easily since they're numeric).

tavinathanson commented 7 years ago

@jburos I think we can have both of those things. Caching on the isovar results is useful in terms of time savings, for sure; and this does it in a better way. And loading in unfiltered variants addresses other potential bugs.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.4%) to 52.164% when pulling d0f8927cb9e6aec5d62dc15d84d72c77ee94cc43 on fix-issue-247 into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.4%) to 52.133% when pulling d0f8927cb9e6aec5d62dc15d84d72c77ee94cc43 on fix-issue-247 into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

jburos commented 7 years ago

So, I tried to add a filter_by_variants step to the load_single_patient_isovar process (based on a portion of code from isovar.allele_reads here), but on further thought it seems much cleaner to leave this logic within isovar. Instead I now have a phantom _load_single_patient_isovar_unfiltered function which isn't used anywhere. I will likely remove this.

Also ended up tweaking our logging code (per #246) so that I could confirm that the cache-files were indeed being re-used without problems. LMK if you have a strong opinion about this. I can revert it, just needed it temporarily so that I could actually see my debug messages.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-1.04%) to 51.492% when pulling b261796070a102cb08fd4bef069f17d7113ba685 on fix-issue-247 into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-1.04%) to 51.492% when pulling b261796070a102cb08fd4bef069f17d7113ba685 on fix-issue-247 into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

jburos commented 7 years ago

@tavinathanson no rush, just FYI I'm done making updates on this PR for now.

jburos commented 7 years ago

Well, I did that at first -- then undid it. It doesn't seem right to me to repeat the logic that's in isovar -- better to keep it in isovar (ie if the logic changes in isovar, wouldn't we then just repeat it here?).

As it stands, we never have a use case for querying on all variants -- in every case we pull the variants in before calling this function -- so if it's confusing I'd prefer to just remove the _unfiltered function & leave the current one as-is. What do you think -- does that make sense?

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.6%) to 51.889% when pulling d31469a45adb7c82ba7482786e15f9d6c4c6e9a6 on fix-issue-247 into ea31881381e592cd26f6125a4723ccd74796ec0e on master.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.6%) to 52.069% when pulling 18808ca9b4ecc77ccb33445e003cd8e3f8957815 on fix-issue-247 into f9f4af2384760a656acadbfd74c5ccd8fade522a on master.