Closed ireneisdoomed closed 1 month ago
I've implemented a solution where I've been able to map ~70%(1366/2015) of the newer traits. I'm pretty satisfied with the solution because most of the traits that don't have a mapping are new traits that weren't part of the 450k. There are 1473 unique traits in the 450k set, and now I am able to map 1366 of them, which kind of solves the problem of parity between releases.
The implementation is based on computing similarity between traits based on their cosine distance, a common approach. I've generated vector representations of each trait in the old and the new set, created a vector db which I can query to extract the most similar ID for my vector. Maybe it is a little bit overkill for this, but I'd already done something similar and I think disease mapping is a good usecase for these type of solutions. The code is in this Colab https://colab.research.google.com/drive/1l15sHiGpItZrXnvXWwJiyyCW9gQ3KCcT?usp=sharing
I've created a spreadsheet to look at the results and curate the remaining 600 traits. Overall, the average similarity score is 0.87. To help curation for the remaining traits, I'll use the other burden evidence for a gene as an indicator, if there's any. https://docs.google.com/spreadsheets/d/1mCGJVpZr0DjyqWvPeapbZOtG7iRmhgQiGzGZQsSsR7Y/edit?usp=sharing
Once we have done this, I can update the mappings table and generate evidence.
šContinuation
Suggesting based on other associations was exploding the data too much, so I preferred to take an approach based on similarity as well. In this case, I computed similarity with traits present in all burden results. I was able to map the remaining 649, however the average similarity score went down to 0.73, which indicates that we have to take a closer look at these.
Overall, half of the traits have a similarity score >0.85, greatly helping curation. I think all results with a similarity score lower than 0.8 needs to be checked. I'll validate the results next.
Updated spreadsheet: https://docs.google.com/spreadsheets/d/1mhavbJejerCeSARev1Xym2ilSyEpoCksPMnWivrzZhI/edit#gid=0 Updated code: https://colab.research.google.com/drive/1l15sHiGpItZrXnvXWwJiyyCW9gQ3KCcT?authuser=1
It looks UK Biobank 460k WGS Public data release don't include burden test to biomarker/blood chemistry and blood cell counts. Please confirm with AZ. Thanks.
Hi @AR-Shicheng, the About page of the PheWAS portal says that their analyses of the 460k WGS data are included. See the results section of their paper for more details https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1
I'd like to confirm that the 'AZ Phewas 470k burden results' are based on WGS data, correct? I'm confused because the term 'ExWAS' is used, which I believe represents Exome. Could you clarify this? Specifically, does 'ExWAS results from a cohort of 450k to 470k people' refer to exome data?
Hi @AR-Shicheng, I understand the confusion. The documentation isn't very clear, but let me clarify.
The 470k set, referred to as v5, is based on exome data, while the 460k genomes represent a different dataset. When using their portal, you can choose which set of results you want to browse at the top of the page:
Currently, I don't think there is no combined view of these datasets. For our purposes, we are integrating the v5 dataset, which is the data they have released. This includes only the associations based on WES data.
I hope this clears things up!
Hi @AR-Shicheng, I understand the confusion. The documentation isn't very clear, but let me clarify. The 470k set, referred to as v5, is based on exome data, while the 460k genomes represent a different dataset. When using their portal, you can choose which set of results you want to browse at the top of the page:
Currently, I don't think there is no combined view of these datasets. For our purposes, we are integrating the v5 dataset, which is the data they have released. This includes only the associations based on WES data.
I hope this clears things up!
Thank you for the clarification. Will the 460K WGS data from the AZ website be included in the next release of Open Targets? Thanks.
The associations from burden results in the AZ Phewas Portal were updated end of last year. The update aggregates the ExWAS results from a cohort of 450k to 470k people. We want to ingest this
Background
The raw data are several compressed CSVs that we can download from the CGR Portal I've converted it to Parquet files:
gs://otar000-evidence_input/GeneBurden/data_files/azphewas-com-470k-phewas-binary
gs://otar000-evidence_input/GeneBurden/data_files/azphewas-com-470k-phewas-quantitative
gs://otar000-evidence_input/GeneBurden/data_files/azphewas-com-470k-phewas-proteomics
Comparison between 470k and 450k results
Our evidence consists overall consists of filtering the burden results to keep: 1) associations below the 1E-7 threshold, 2) associations that involve non synonymous variants. I've generated the evidence for the new data and compared to the previous one (currently in production). New evidence is here
gs://ot-team/irene/az_470k_evd
I think this shows that now we are able to find statistically significance differences thanks to the greater sample size. This means that we could potential have a very substantial gain of 60% more associations.
Sorting out phenotype mapping
The gain in new associations mean that we have a bigger coverage in traits, which we have to map.
On top of that, I've also realised that many of the phenotype descriptions have changed between cohorts:
This just means that although the disease is the same between cohorts, its description has changed. This is annoying because we map traits by name using this table. An example:
41202#I25#I25 Chronic ischaemic heart disease
in the 450k set41202#I25#Chronic ischaemic heart disease
in the 470k set (note the second I25 is missing)Union phenotypes also have different descriptions: 450k says
Union
, 470 saysunion
. This is simple to address.Tasks
The biggest problem of ingesting the new data is mapping the new traits. The data model has not changed, so once it is sorted ingesting it will be easy.
In following comments I'll describe what we can do about it, and some comments about the proteomics set, which I don't think is in scope for this release.