Mapped disease is null when OnToma finds an accurate mapping

Describe the bug There is a problem with how the mapped disease is added to the full evidence dataframe. This results in evidence containing mappable diseases to have however a null value in the 'diseaseFromSourceMappedId' field, and ultimately, losing them.

Observed behaviour We have 15 evidence of association from G2P that involve EPILEPTIC ENCEPHALOPATHY that do not make it to the Platform because the disease mapping field is null.

In this case, the disease mapping comes from running OnToma. When you run OnToma on that string, you have a successful result:

otmap.find_term('EPILEPTIC ENCEPHALOPATHY')
INFO     - ontoma.interface - Processed: EPILEPTIC ENCEPHALOPATHY → [OnTomaResult(query='EPILEPTIC ENCEPHALOPATHY', id_normalised='HP:0200134', id_ot_schema='HP_0200134', id_full_uri='http://purl.obolibrary.org/obo/HP_0200134', label='epileptic encephalopathy')]

This is however not making it to the evidence containing that disease, so something is wrong in the implementation that adds the disease mapping to an evidence dataframe.

Investigation OnToma has 2 approaches to map a disease: 1) by querying the disease term, 2) by querying an ontology term.

The way disease mapping is built into the evidence df is by the common function add_efo_mapping.

The way the function works is basically by extracting diseaseFromSource and diseaseFromSourceId, run OnToma on top (by using Pandas's apply), make a df with the OnToma result, and join this df with the full evidence set on diseaseFromSource and diseaseFromSourceId.

The source of the problem is basically the step where both dfs are joined. OnToma is ran on a Pandas df, and when making the conversion between Spark/Pandas dfs, null values are messed in a way that when we join back the data, nulls in evidence_strings are not represented equally as nulls in disease_info_df.

disease_info_df.show()
+--------------------+-------------------+-------------------------+
|   diseaseFromSource|diseaseFromSourceId|diseaseFromSourceMappedId|
+--------------------+-------------------+-------------------------+
|EPILEPTIC ENCEPHA...|               None|               HP_0200134|
+--------------------+-------------------+-------------------------+

evidence_strings.select('diseaseFromSource', 'diseaseFromSourceId').distinct().show()
+--------------------+-------------------+
|   diseaseFromSource|diseaseFromSourceId|
+--------------------+-------------------+
|EPILEPTIC ENCEPHA...|               null|
+--------------------+-------------------+

Expected behaviour For the 15 above mentioned G2P evidence strings, I expect to have them mapped to HP_0200134. And therefore, be able to see their reported association in our GRIN2B/Epileptic encephalopathy page

opentargets / issues

Mapped disease is null when OnToma finds an accurate mapping #2556