ohsu-comp-bio / g2p-aggregator

Associations of genomic features, drugs and diseases
48 stars 11 forks source link

jax trials feature harmonization #78

Closed bwalsh closed 6 years ago

bwalsh commented 6 years ago

Following up on our conversation re. jax trials.

Observations:

I've checked in a test to illustrate the problem harvester/tests/integration/test_jax_trials_features.py

I've validated that the MM variant endpoint does not return a false hit.

There are a chain of potential challenges and fixes:

To run: $ pytest -s tests/integration/test_jax_trials_features.py::test_profiles

profile >MET alterations< gene_index[i] >MET< mut_index[i] >< matches 418
profile >MET positive< gene_index[i] >MET< mut_index[i] >< matches 418
profile >MET amp< gene_index[i] >MET< mut_index[i] >< matches 418
jgoecks commented 6 years ago

The simplest approach is to bypass any normalization attempts if _parse_profile does not return any mutations. The key issue is that MM's API appears to return all mutations in a gene when given something like "MET amp", which is confusing and problematic in our case.

bwalsh commented 6 years ago

@jgoecks

Quick clarification, mm does properly return no mutations

I've validated that the MM variant endpoint does not return a false hit.

I've made the following change

+++ b/harvester/cosmic_lookup_table.py
@@ -39,6 +39,9 @@ class CosmicLookup(object):
             # return null
             logging.warning('get_entries gene: %s, hgvs_p: %s', gene, hgvs_p)
             return []
+        # ensure caller passed a hgvs_p
+        if not hgvs_p or len(hgvs_p) == 0:
+            return []
         # Get lookup table.
         if gene in self.gene_df_cache:
             # Found gene-filtered lookup table in cache.

jax and jax_trials feature_normalization went from 85%, 93% to 43%, 11% respectively

jgoecks commented 6 years ago

Ah, so the issue was that the COSMIC lookup table was returning all mutations, not MM. Your change looks reasonable. Thanks!

grmayfie commented 6 years ago

@bwalsh When I run this test I see:

profile >MET alterations< gene_index[i] >MET< mut_index[i] >< matches 0
profile >MET positive< gene_index[i] >MET< mut_index[i] >< matches 0
profile >MET amp< gene_index[i] >MET< mut_index[i] >< matches 0

And then the test fails with

E       AssertionError: assert 3 == 0
E        +  where 3 = len([{'biomarker_type': 'mutant', 'geneSymbol': 'MET', 'name': 'MET  '}, {'biomarker_type': 'polymorphism', 'geneSymbol': 'MET', 'name': 'MET  positive'}, {'biomarker_type': 'polymorphism', 'geneSymbol': 'MET', 'name': 'MET  amp'}])

Based on the way this test is written, I think it really should be 3. Why did we have 0 before?

bwalsh commented 6 years ago

thanks, will check tomorrow