Closed ireneisdoomed closed 10 months ago
AOPWiki schema in .xsd format: https://aopwiki.org/AOP-XML%20Schema%20definition.xsd
The first iter of both datasets looks great!
gs://ot-team/jarrod/aopV1/
gs://ot-team/jarrod/keV1/
I've made some minor comments in the schema tracker to follow-up on future iterations.
One thing I've noted regarding the key events is that we have many cases of duplication. For instance:
kev1.filter(col('id') == 209).drop('references').show(truncate=False)
+---------------------------------------------------+---------------------------+---+-----------------+------------+-----------------+
|biologicalEvents |biologicalOrganisationLevel|id |keyEventStressors|organTerm |title |
+---------------------------------------------------+---------------------------+---+-----------------+------------+-----------------+
|[{increased, {null, null, null}, oxidative stress}]|Molecular |209|null |{null, null}|Peptide Oxidation|
|[{null, {null, null, null}, null}] |null |209|null |{null, null}|null |
|[{null, {null, null, null}, null}] |null |209|null |{null, null}|null |
|[{null, {null, null, null}, null}] |null |209|null |{null, null}|null |
+---------------------------------------------------+---------------------------+---+-----------------+------------+-----------------+
whereas only the first record matches the source page for this key event (https://aopwiki.org/events/209). Can you investigate this @JarrodBaker?
@ireneisdoomed Is this ticket still relevant?
Not for this release @JarrodBaker
The maintenance and iteration of this data will be taken over by the data team.
This is the script @JarrodBaker wrote in the past: https://github.com/opentargets/platform-etl-backend/blob/ca524fd9b525cd5e5ae4ae7ce74d492ce6893f67/src/main/scala/io/opentargets/etl/backend/AdverseOutcomePathway.scala
We want to rewrite this in Pyspark and continue the development by:
A bug has been identified between the XML dump and the correct data: https://aopwiki.org/forums/showthread.php?tid=189
I cannot proceed until this is fixed.
happy to close @ireneisdoomed?
Done
After the work described in #1442, we need to parse the AOPWiki resource so that we can validate the proposed data schema and then integrate it into the new safetyLiabilities dataset.
AOPWiki is available in the form of a XML file, downloadable from here https://aopwiki.org/downloads
We want to build two different datasets: one for AOPs and another one for key events following the schema described in the spreadsheet: https://docs.google.com/spreadsheets/d/1OZidpR3kQIp1-FOU49PQQXgvSwL2kbM4GPTmvGqhPUU/edit#gid=61481225
In the column
Source
it is listed the field that needs to be picked up to build the dataset. In many cases this will require to use their own entity ids as primary key to retrieve the information.