opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Clean up biofeature mappings #2072

Closed Jeremy37 closed 1 year ago

Jeremy37 commented 2 years ago

In the example below, you can see that "Brain (DLPFC)" appears 4 times, with 2 different spellings, in the colocalisation section of the study-locus page. It should appear in only one column. The way biofeature mappings are handled is messy, and makes it difficult to understand how the labels are set and therefore difficult to get correct. This relates to the "biofeature hack" that Miguel said was necessary, though it isn't clear to me why. With some discussion between the right people in genetics, backend, and front-end, it should be possible to clean this up.

Currently, we have a table of mappings with lines like this: {"original_source":"eqtl_catalogue_v2","study":"GTEx-eQTL","biofeature_string":"brain_frontal_cortex","biofeature_code":"BRAIN_FRONTAL_CORTEX","biofeature_ontology_term":"UBERON_0009834","label":"brain (DLPFC)","tissue_label":"brain (DLPFC)","condition_label":"naive"}

But for the biofeature hack, we need to create "composite" lines like this with a "biofeature_code" that has both the study ID and the tissue ID, which I think is needed somehow in the BE or FE. {"original_source":"composite_source_feature_hack","biofeature_string":"GTEx-eQTL-brain_cortex","label":"Brain cortex (GTEx v8)","biofeature_code":"GTEx-eQTL-BRAIN_CORTEX","biofeature_ontology_term":"UBERON_0001870","study":"GTEx-eQTL","tissue_label":"brain (cortex)","condition_label":"naive"}

image

DSuveges commented 2 years ago

As I understand there are a bunch of datasets feeding into various parts of the genetics portal release pipelines (p,e,sQLT, intervals), which have potential cell-line/tissue annotation. The problem here is when, and how to consolidate this annotation. The example brain (DLPFC) is coming from two sources: GTEx-eQTL and CommonMind but they are not aggregated. These annotations should make one single column.

Most likely a curation process should take place on the collected tissue/cell annotations. Then this curated mappings file would be picked up by the ETL upon joining all this data together. Intuitively there should be one single source of mapping that links all these annotation to a shared ground.

buniello commented 1 year ago

closing as added to feature doc