Ingest AZ Phewas 470k burden results

ireneisdoomed commented 1 month ago

The associations from burden results in the AZ Phewas Portal were updated end of last year. The update aggregates the ExWAS results from a cohort of 450k to 470k people. We want to ingest this

Background

The raw data are several compressed CSVs that we can download from the CGR Portal I've converted it to Parquet files:

Results from binary phenotypes: gs://otar000-evidence_input/GeneBurden/data_files/azphewas-com-470k-phewas-binary
Results from quantitative phenotypes: gs://otar000-evidence_input/GeneBurden/data_files/azphewas-com-470k-phewas-quantitative
Results from proteomics measurements (burden on PPP): gs://otar000-evidence_input/GeneBurden/data_files/azphewas-com-470k-phewas-proteomics

Comparison between 470k and 450k results

Our evidence consists overall consists of filtering the burden results to keep: 1) associations below the 1E-7 threshold, 2) associations that involve non synonymous variants. I've generated the evidence for the new data and compared to the previous one (currently in production). New evidence is here gs://ot-team/irene/az_470k_evd

	450k	470k
# evidence	18129	28444
# unique target/phenotype	5592	9004
mean p Value	7.87E-9	7.58E-9
# unique phenotypes	1473	2015
# binary phenotypes	970	1156
# quantitative phenotypes	502	856
# union phenotypes	2939	3745

I think this shows that now we are able to find statistically significance differences thanks to the greater sample size. This means that we could potential have a very substantial gain of 60% more associations.

Sorting out phenotype mapping

The gain in new associations mean that we have a bigger coverage in traits, which we have to map.

On top of that, I've also realised that many of the phenotype descriptions have changed between cohorts:

97% of the 470k associations are not found in the 450k set (potentially new)
96% of the 450k associations are not found in the 470k set (potentially lost)

This just means that although the disease is the same between cohorts, its description has changed. This is annoying because we map traits by name using this table. An example:

LDLR is associated with 41202#I25#I25 Chronic ischaemic heart disease in the 450k set
LDLR is associated with 41202#I25#Chronic ischaemic heart disease in the 470k set (note the second I25 is missing)

Union phenotypes also have different descriptions: 450k says Union, 470 says union. This is simple to address.

Tasks

The biggest problem of ingesting the new data is mapping the new traits. The data model has not changed, so once it is sorted ingesting it will be easy.

In following comments I'll describe what we can do about it, and some comments about the proteomics set, which I don't think is in scope for this release.

ireneisdoomed commented 1 month ago

Phenotype mapping

I've implemented a solution where I've been able to map ~70%(1366/2015) of the newer traits. I'm pretty satisfied with the solution because most of the traits that don't have a mapping are new traits that weren't part of the 450k. There are 1473 unique traits in the 450k set, and now I am able to map 1366 of them, which kind of solves the problem of parity between releases.

The implementation is based on computing similarity between traits based on their cosine distance, a common approach. I've generated vector representations of each trait in the old and the new set, created a vector db which I can query to extract the most similar ID for my vector. Maybe it is a little bit overkill for this, but I'd already done something similar and I think disease mapping is a good usecase for these type of solutions. The code is in this Colab https://colab.research.google.com/drive/1l15sHiGpItZrXnvXWwJiyyCW9gQ3KCcT?usp=sharing

I've created a spreadsheet to look at the results and curate the remaining 600 traits. Overall, the average similarity score is 0.87. To help curation for the remaining traits, I'll use the other burden evidence for a gene as an indicator, if there's any. https://docs.google.com/spreadsheets/d/1mCGJVpZr0DjyqWvPeapbZOtG7iRmhgQiGzGZQsSsR7Y/edit?usp=sharing

Once we have done this, I can update the mappings table and generate evidence.

ireneisdoomed commented 1 month ago

👆Continuation

Suggesting based on other associations was exploding the data too much, so I preferred to take an approach based on similarity as well. In this case, I computed similarity with traits present in all burden results. I was able to map the remaining 649, however the average similarity score went down to 0.73, which indicates that we have to take a closer look at these.

Overall, half of the traits have a similarity score >0.85, greatly helping curation. I think all results with a similarity score lower than 0.8 needs to be checked. I'll validate the results next.

Updated spreadsheet: https://docs.google.com/spreadsheets/d/1mhavbJejerCeSARev1Xym2ilSyEpoCksPMnWivrzZhI/edit#gid=0 Updated code: https://colab.research.google.com/drive/1l15sHiGpItZrXnvXWwJiyyCW9gQ3KCcT?authuser=1

AR-Shicheng commented 1 month ago

It looks UK Biobank 460k WGS Public data release don't include burden test to biomarker/blood chemistry and blood cell counts. Please confirm with AZ. Thanks.

ireneisdoomed commented 1 month ago

Hi @AR-Shicheng, the About page of the PheWAS portal says that their analyses of the 460k WGS data are included. See the results section of their paper for more details https://www.medrxiv.org/content/10.1101/2023.12.06.23299426v1

AR-Shicheng commented 2 weeks ago

I'd like to confirm that the 'AZ Phewas 470k burden results' are based on WGS data, correct? I'm confused because the term 'ExWAS' is used, which I believe represents Exome. Could you clarify this? Specifically, does 'ExWAS results from a cohort of 450k to 470k people' refer to exome data?

ireneisdoomed commented 2 weeks ago

Hi @AR-Shicheng, I understand the confusion. The documentation isn't very clear, but let me clarify. The 470k set, referred to as v5, is based on exome data, while the 460k genomes represent a different dataset. When using their portal, you can choose which set of results you want to browse at the top of the page:

Currently, I don't think there is no combined view of these datasets. For our purposes, we are integrating the v5 dataset, which is the data they have released. This includes only the associations based on WES data.

I hope this clears things up!

AR-Shicheng commented 2 weeks ago

Hi @AR-Shicheng, I understand the confusion. The documentation isn't very clear, but let me clarify. The 470k set, referred to as v5, is based on exome data, while the 460k genomes represent a different dataset. When using their portal, you can choose which set of results you want to browse at the top of the page:

Currently, I don't think there is no combined view of these datasets. For our purposes, we are integrating the v5 dataset, which is the data they have released. This includes only the associations based on WES data.

I hope this clears things up!

Thank you for the clarification. Will the 460K WGS data from the AZ website be included in the next release of Open Targets? Thanks.

opentargets / issues