opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Parse AOPWiki source #1605

Closed ireneisdoomed closed 10 months ago

ireneisdoomed commented 3 years ago

After the work described in #1442, we need to parse the AOPWiki resource so that we can validate the proposed data schema and then integrate it into the new safetyLiabilities dataset.

AOPWiki is available in the form of a XML file, downloadable from here https://aopwiki.org/downloads

We want to build two different datasets: one for AOPs and another one for key events following the schema described in the spreadsheet: https://docs.google.com/spreadsheets/d/1OZidpR3kQIp1-FOU49PQQXgvSwL2kbM4GPTmvGqhPUU/edit#gid=61481225

In the column Source it is listed the field that needs to be picked up to build the dataset. In many cases this will require to use their own entity ids as primary key to retrieve the information.

ireneisdoomed commented 3 years ago

AOPWiki schema in .xsd format: https://aopwiki.org/AOP-XML%20Schema%20definition.xsd

ireneisdoomed commented 3 years ago

The first iter of both datasets looks great!

I've made some minor comments in the schema tracker to follow-up on future iterations.

One thing I've noted regarding the key events is that we have many cases of duplication. For instance:

kev1.filter(col('id') == 209).drop('references').show(truncate=False)
+---------------------------------------------------+---------------------------+---+-----------------+------------+-----------------+
|biologicalEvents                                   |biologicalOrganisationLevel|id |keyEventStressors|organTerm   |title            |
+---------------------------------------------------+---------------------------+---+-----------------+------------+-----------------+
|[{increased, {null, null, null}, oxidative stress}]|Molecular                  |209|null             |{null, null}|Peptide Oxidation|
|[{null, {null, null, null}, null}]                 |null                       |209|null             |{null, null}|null             |
|[{null, {null, null, null}, null}]                 |null                       |209|null             |{null, null}|null             |
|[{null, {null, null, null}, null}]                 |null                       |209|null             |{null, null}|null             |
+---------------------------------------------------+---------------------------+---+-----------------+------------+-----------------+

whereas only the first record matches the source page for this key event (https://aopwiki.org/events/209). Can you investigate this @JarrodBaker?

JarrodBaker commented 3 years ago

@ireneisdoomed Is this ticket still relevant?

d0choa commented 3 years ago

Not for this release @JarrodBaker

ireneisdoomed commented 2 years ago

The maintenance and iteration of this data will be taken over by the data team.

This is the script @JarrodBaker wrote in the past: https://github.com/opentargets/platform-etl-backend/blob/ca524fd9b525cd5e5ae4ae7ce74d492ce6893f67/src/main/scala/io/opentargets/etl/backend/AdverseOutcomePathway.scala

We want to rewrite this in Pyspark and continue the development by:

ireneisdoomed commented 2 years ago

A bug has been identified between the XML dump and the correct data: https://aopwiki.org/forums/showthread.php?tid=189

I cannot proceed until this is fixed.

d0choa commented 1 year ago

happy to close @ireneisdoomed?

ireneisdoomed commented 10 months ago

Done