opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

PIS transformation extraction: `so` step #3499

Closed javfg closed 1 month ago

javfg commented 2 months ago

Description

The so step downloads the sequence ontology.

Transformations PIS was doing

PIS was downloading the OWL version and using Apache Jena to convert it to JSONL. There is a JSON version, so we can use that one to avoid Jena and transform it into JSONL directly.

The query PIS used to convert Jena's output into JSONL is:

'.["@graph"][] | select (.["@type"] == "owl:Class" and .id != null)|.subClassOf=([.subClassOf]|flatten|del(.[]|nulls))|.hasExactSynonym=([.hasExactSynonym]|flatten|del(.[]|nulls))|.hasDbXref=([.hasDbXref]|flatten|del(.[]|nulls))|.inSubset=([.inSubset]|flatten|del(.[]|nulls))|.hasAlternativeId=([.hasAlternativeId]|flatten|del(.[]|nulls))|@json'

But we must adapt it to the so.json version, as it is not the same as Jena's output.

Tasks

jdhayhurst commented 1 month ago

While doing this, please could we make output unified with the rest of the ETL i.e. parquet output in the ETL output path? Currently POS loads the so.json directly from the input folder, which couples POS to PIS, when ideally it should be handling ETL outputs only.

remo87 commented 1 month ago

Json files are exposed in a github repo

remo87 commented 1 month ago

The Json file needs to be mapped the structure is different to the owl file. I'll be using the extractor to extract 3 different inputs and create a new step in the ETL to process the inputs and produce the SO in the correct format.

remo87 commented 1 month ago

The extractor changes are done and the mappings are being worked on

remo87 commented 1 month ago

After discussing with @d0choa the mapping has been simplified and the output now only contains id and label which are the only fields being exposed by the API