opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Mapped disease is null when OnToma finds an accurate mapping #2556

Closed ireneisdoomed closed 2 years ago

ireneisdoomed commented 2 years ago

Describe the bug There is a problem with how the mapped disease is added to the full evidence dataframe. This results in evidence containing mappable diseases to have however a null value in the 'diseaseFromSourceMappedId' field, and ultimately, losing them.

Observed behaviour We have 15 evidence of association from G2P that involve EPILEPTIC ENCEPHALOPATHY that do not make it to the Platform because the disease mapping field is null.

In this case, the disease mapping comes from running OnToma. When you run OnToma on that string, you have a successful result:

otmap.find_term('EPILEPTIC ENCEPHALOPATHY')
INFO     - ontoma.interface - Processed: EPILEPTIC ENCEPHALOPATHY → [OnTomaResult(query='EPILEPTIC ENCEPHALOPATHY', id_normalised='HP:0200134', id_ot_schema='HP_0200134', id_full_uri='http://purl.obolibrary.org/obo/HP_0200134', label='epileptic encephalopathy')]

This is however not making it to the evidence containing that disease, so something is wrong in the implementation that adds the disease mapping to an evidence dataframe.

Investigation OnToma has 2 approaches to map a disease: 1) by querying the disease term, 2) by querying an ontology term.

The way disease mapping is built into the evidence df is by the common function add_efo_mapping.

The way the function works is basically by extracting diseaseFromSource and diseaseFromSourceId, run OnToma on top (by using Pandas's apply), make a df with the OnToma result, and join this df with the full evidence set on diseaseFromSource and diseaseFromSourceId.

The source of the problem is basically the step where both dfs are joined. OnToma is ran on a Pandas df, and when making the conversion between Spark/Pandas dfs, null values are messed in a way that when we join back the data, nulls in evidence_strings are not represented equally as nulls in disease_info_df.

disease_info_df.show()
+--------------------+-------------------+-------------------------+
|   diseaseFromSource|diseaseFromSourceId|diseaseFromSourceMappedId|
+--------------------+-------------------+-------------------------+
|EPILEPTIC ENCEPHA...|               None|               HP_0200134|
+--------------------+-------------------+-------------------------+

evidence_strings.select('diseaseFromSource', 'diseaseFromSourceId').distinct().show()
+--------------------+-------------------+
|   diseaseFromSource|diseaseFromSourceId|
+--------------------+-------------------+
|EPILEPTIC ENCEPHA...|               null|
+--------------------+-------------------+

Expected behaviour For the 15 above mentioned G2P evidence strings, I expect to have them mapped to HP_0200134. And therefore, be able to see their reported association in our GRIN2B/Epileptic encephalopathy page

ireneisdoomed commented 2 years ago

This has now been fixed in https://github.com/opentargets/evidence_datasource_parsers/pull/122 Metrics of invalid evidence due to unresolved disease are better for those sources where OnToma is used.

There was an additional problem to the null handling explained above, which was the main cause of the lack of mappings. Spark's join operator is not null safe by default and most of the times, diseaseFromSourceId will be null. So that to have a succesful mapping regardless of a null disease Id, eqNullSafe is used as a comparator between columns. eqNullSafe is a special null safe equality operator that is used to join the two dataframes.