Closed ireneisdoomed closed 2 years ago
This has now been fixed in https://github.com/opentargets/evidence_datasource_parsers/pull/122 Metrics of invalid evidence due to unresolved disease are better for those sources where OnToma is used.
There was an additional problem to the null handling explained above, which was the main cause of the lack of mappings. Spark's join operator is not null safe by default and most of the times, diseaseFromSourceId
will be null. So that to have a succesful mapping regardless of a null disease Id, eqNullSafe
is used as a comparator between columns.
eqNullSafe
is a special null safe equality operator that is used to join the two dataframes.
Describe the bug There is a problem with how the mapped disease is added to the full evidence dataframe. This results in evidence containing mappable diseases to have however a null value in the 'diseaseFromSourceMappedId' field, and ultimately, losing them.
Observed behaviour We have 15 evidence of association from G2P that involve
EPILEPTIC ENCEPHALOPATHY
that do not make it to the Platform because the disease mapping field is null.In this case, the disease mapping comes from running OnToma. When you run OnToma on that string, you have a successful result:
This is however not making it to the evidence containing that disease, so something is wrong in the implementation that adds the disease mapping to an evidence dataframe.
Investigation OnToma has 2 approaches to map a disease: 1) by querying the disease term, 2) by querying an ontology term.
The way disease mapping is built into the evidence df is by the common function add_efo_mapping.
The way the function works is basically by extracting
diseaseFromSource
anddiseaseFromSourceId
, run OnToma on top (by using Pandas'sapply
), make a df with the OnToma result, and join this df with the full evidence set ondiseaseFromSource
anddiseaseFromSourceId
.The source of the problem is basically the step where both dfs are joined. OnToma is ran on a Pandas df, and when making the conversion between Spark/Pandas dfs, null values are messed in a way that when we join back the data, nulls in
evidence_strings
are not represented equally as nulls indisease_info_df
.Expected behaviour For the 15 above mentioned G2P evidence strings, I expect to have them mapped to
HP_0200134
. And therefore, be able to see their reported association in our GRIN2B/Epileptic encephalopathy page