Closed d0choa closed 1 month ago
Is it going to be part of gentropy or orchestration? I guess this is going to be a dag?
Ticket updated
Data generation is blocked by https://github.com/opentargets/issues/issues/3555
Hi @ireneisdoomed, #3555 has been completed, so this task should be unblocked.
Thank you @prashantuniyal02. I think I still need a new Gentropy data release to generate L2G results that align with the rest of the datasets.
Regarding the specification:
locus2geneId
- Not sure if it is needed, because we can resolve the l2g dataset based on target and locus identifiers.targetFromSourceId
- Yes, we need this because this dataset is picked up as other evidence by the ETL, it doesn't matter if the ids are already validated.diseaseFromSourceId
- we should use diseaseFromSourceMappedId
Specification at the top updated based on discussion with @ireneisdoomed and @DSuveges
Data
The new GWAS credible set evidence will require the next fields:
The generation of this data needs to be implemented by taking the
l2g_predictions
and exploding the EFOs using the credible sets and study_index. For now, we will reproduce the previous L2G threshold of 0.05.studyLocusId
- New field in the json_schematargetFromSourceId
diseaseFromSourceMappedId
resourceScore
L2G scoring will be used to score the evidence (as before)datasourceId
- to be namedgwas_credible_sets
As discussed with @DSuveges, it would be good to eventually complete the evidence with more metadata to make it more readable using the pre-existing fields in the schema (e.g. studyId, pValue, etc.). This is only scoping the minimum set of data that MUST be there.
Platform ETL
To consume this data we will need to run the platform ETL (from evidence downwards). For now, l would run this evidence side-by-side with the current evidence, even though associations will be corrupted. Eventually we will drop the old evidence.
Note: If I remember correctly
targetId
anddiseaseId
areunique-fields
by default, so this is just what else defines unicity.Evidence API
New columns that need to be added to the evidence endpoint:
studyLocusId
- String -> Resolvable through credible sets APIA follow up action will be to populate the top L2G column in the credible sets using this data. But we can scope that once this data has been loaded.
@ireneisdoomed could you manually hand-craft this dataset in JSONL for @jdhayhurst? I think we have been using
studyLocusId
from 22.06 so perhaps it's the less problematic one to use for now.