For context, GWAS studies have a unique study_id for each analysis. In cases like the UKBB or GWAS catalogue studies with multiple GWASes, the ID is amended with an integer capturing a unique identifier for the trait under study (egNEALE_1). This helps to keep the uniqueness of the GWAS and we can trace all the study details with a single study_id.
Molecular trait studies (eQTL, pQTL, sQTL), contain a few extra considerations to capture all the granularity of the study. A single study (e.g. study_id: GTEx-eQTL), might contain multiple phenotypes (e.g. transcripts, proteins, splice sites) and multiple biofeatures (e.g. tissues, cell types). Differently from the GWAS studies, if they come from the same publication/data release" they are all captured in a single study_id (e.g. GTEx-eQTL). This lack of granularity implies that the study_id is not enough to characterise the trait and we start carrying over a lot of metadata throughout pipelines, API and FE.
The next set of fields are information that we carry over to describe the study. And we are planning to expand this to include gene_id when appropiate. More info in #2688
The proposal here is to create an appropriate study index capturing molecular trait metadata. This index will populate information based purely on a new study_id that capture all the molecular trait granularity. This will allow us to build a graphQL index that will resolve consistently the study entity making the data lighter and reducing the required logic in multiple places of the codebase. This will also help standardise the way study information is queried accross the codebase.
phenotype_id -> gene_id
We have done the phenotype_id -> gene_id mapping in multiple places based on sometimes incomplete lookup tables. This is an opportunity to resolve this issue. There are currently several locations containing LUTs some with incomplete mappings. Some background is available in #2670
Search
Changes in the study index can affect how search behaves. We will need to review this when the time comes.
@DSuveges can you help me review this ticket, scope it and assign appropiately?
Background
For context, GWAS studies have a unique
study_id
for each analysis. In cases like the UKBB or GWAS catalogue studies with multiple GWASes, the ID is amended with an integer capturing a unique identifier for the trait under study (egNEALE_1
). This helps to keep the uniqueness of the GWAS and we can trace all the study details with a singlestudy_id
.Molecular trait studies (eQTL, pQTL, sQTL), contain a few extra considerations to capture all the granularity of the study. A single study (e.g.
study_id: GTEx-eQTL
), might contain multiple phenotypes (e.g. transcripts, proteins, splice sites) and multiple biofeatures (e.g. tissues, cell types). Differently from the GWAS studies, if they come from the same publication/data release" they are all captured in a singlestudy_id
(e.g.GTEx-eQTL
). This lack of granularity implies that thestudy_id
is not enough to characterise the trait and we start carrying over a lot of metadata throughout pipelines, API and FE.For example in the next
credible_set
entry:The next set of fields are information that we carry over to describe the study. And we are planning to expand this to include
gene_id
when appropiate. More info in #2688In
coloc
, because we compare 2 studies this is even more dramatic and it adds extra complication on the handling of unique studies (example)Proposal
The proposal here is to create an appropriate study index capturing molecular trait metadata. This index will populate information based purely on a new
study_id
that capture all the molecular trait granularity. This will allow us to build a graphQL index that will resolve consistently the study entity making the data lighter and reducing the required logic in multiple places of the codebase. This will also help standardise the way study information is queried accross the codebase.Considerations
Biofeature mappings There seems to be some unresolved issues around biofeature mappings that could be resolved as part of this work. The presence of the infamous hack, just seems to be a patch to cover the absence of appropiate data modelling. https://docs.google.com/document/d/1uf3NH0u87DYbk3Uf_rjxqMa5R7KdbOO4TvkBwkmc7Ss/edit
phenotype_id -> gene_id We have done the
phenotype_id
->gene_id
mapping in multiple places based on sometimes incomplete lookup tables. This is an opportunity to resolve this issue. There are currently several locations containing LUTs some with incomplete mappings. Some background is available in #2670Search Changes in the study index can affect how search behaves. We will need to review this when the time comes.
@DSuveges can you help me review this ticket, scope it and assign appropiately?