opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Harmonise UKB PPP data #3234

Closed tskir closed 1 week ago

tskir commented 3 months ago

Notes on data

Storage

Format

Processing approach

Scope

Harmonisation

ALLELE0 and ALLELE1 do not necessarily correspond to REF and ALT. Hence, we adopt the following algorithm:

Filtering

We will use hard filtering as described below, meaning that variants that don't pass the filters are fully discarded and will not be present in the output

After filtering, we expect to observe anywhere from 13 to 20 million SNPs

Special cases

tskir commented 3 months ago

CC @addramir @d0choa @DSuveges

DSuveges commented 3 months ago

Just double checking, when you refer to variant index, it is the full GnomAD4 variant annotation table, right? These decisions sounds solid, just confirming we are dropping variants if they are interpreted on the reverse strand? (I think this is fine choice)

Hard filtering is fine, however as an exploratory step, it would be great to see some stats on:

These tests are addind some extra work, however getting these numbers for at least one study would be very informative and would knowledge on what is actually happening.

After filtering, we expect to observe anywhere from 13 to 20 million SNPs

Just curious, what is the starting number?

tskir commented 3 months ago

Just double checking, when you refer to variant index, it is the full GnomAD4 variant annotation table, right?

Thank you for bringing this up, this is actually something I also wanted to confirm with @addramir. I didn't work with the variant index before, so I'd appreciate the link to which exact dataset I should use.

These decisions sounds solid, just confirming we are dropping variants if they are interpreted on the reverse strand?

I think (@addramir also might be better suited to confirm/deny this) that we expect all variants to be reported on the forward strand, the main issue is that sometimes alleles will be flipped so that ALLELE1 is reference and ALLELE0 is reference. So we don't expect a lot of cases where neither ID is found in the variant index and we have to drop it.

as an exploratory step, it would be great to see some stats on

I will definitely include the detailed statistics on how many variants dropped for which reason, not just for one study but generally in the harmonisation logs. It's going to be useful for debugging and for downstream analysis.

Just curious, what is the starting number?

I've chosen a few random studies and the results were: 15 933 877, 15 933 061, 15 952 258.

d0choa commented 3 months ago

@tskir as for the gzipped files. This worked quite smoothly for the eqtl catalogue. You are working with 10 times more data, so we will see... https://github.com/opentargets/gentropy/blob/dev/src/airflow/dags/eqtl_preprocess.py

tskir commented 3 months ago

@d0choa Unfortunately that template won't work here: it expects a folder with a bunch of .gz files, while what we have is a folder with a bunch of TARs, each of which has a bunch of .gz files inside of it.

But those templates are based on Dataflow, and since I worked with it recently, I should be able to write a simple pipeline to do the decompression, so thank you for the pointer either way!

tskir commented 3 months ago

Three questions about populating the study index

  1. For pQTL studies like UKB PPP, what do we put into traitFromSource / traitFromSourceMappedIds fields? Do we just not populate them? Do we put some term for “protein expression / levels”? I distinctly remember there used to be a discussion about this but I don't think we decided on anything.
  2. For the studyId field, do I just use the original ID and the UKB_PPP prefix? The result would look like this: _UKB_PPP_AARSD1_Q9BTE6_OID21311v1.
  3. For each study, the original metadata also contains panel information, for example “Oncology” or “Inflammation”. Do we need this information at all? If so, where do I put it? I don't think there is a field for it, so it might need to be created if we want this.
addramir commented 3 months ago
  1. I thought it should be liek the name of the file or the name of the GWAS from their meta-data file without adding anything else (like "levels" or similar). What is "traitFromSourceMappedIds" btw?
  2. Hm, I would use something like this "UKB_PPPEUR{"target_id"}" where EUR is the population we are ingesting.
  3. I would drop it.
tskir commented 3 months ago

@addramir

2, 3 — thank you, got it

  1. I thought it should be liek the name of the file or the name of the GWAS from their meta-data file without adding anything else (like "levels" or similar). What is "traitFromSourceMappedIds" btw?

The problem is that there is no trait explicitly specified. Here are all of the metadata fields for one example study:

UKBPPP_ProteinID AARSD1:Q9BTE6:OID21311:v1
olink_target_fullname Alanyl-tRNA editing protein Aarsd1
OlinkID OID21311
UniProt Q9BTE6
Assay AARSD1
Panel Oncology
Panel_Lot_Nr B04412
UniProt2 Q9BTE6
HGNC.symbol AARSD1
ensembl_id ENSG00000266967
chr 17
gene_start 42950526
gene_end 42964498
Strand -1
Dilution_factor 1:1
block B
expansion no

In theory, traitFromSource should contain the original description of a trait being measured (which it is lacking in this case), and traitFromSourceMappedIds would contain one (or many) mapped ontology IDs representing the same trait. So for regular GWASes, traitFromSource might be Alzheimer's disease and traitFromSourceMappedIds could be [EFO:1001870].

Here, the implied trait is “expression of a (given protein)”, which I'm not sure how to correctly represent. Especially given that we already have the separate "gene" column in the study index. Since the context should be clear from the type of study (pQTL) + gene, we might just drop those fields and it would be fine.

@d0choa Do you have any thoughts on this?

addramir commented 3 months ago

I see. Well, my guess, that traitFromSource should be "olink_target_fullname" and traitFromSourceMappedIds should be the same for all: http://www.ebi.ac.uk/efo/EFO_0007937 . Don't forget we need a column that represents the target_id and links the protein and the gene (ensembl_id?), I don't know what is the name for this column in study_index.

d0choa commented 3 months ago

Without going into detail, we want to be consistent with what we have for eQTL Catalogue pQTLs (Sun et al.) code.

In [14]: df.filter(f.col("studyType") == "pqtl").show(3, vertical=True, truncate = False)
-RECORD 0------------------------------------------------------------------------------------------
 studyId              | Sun_2018_plasma_HLA.DQA2.7757.5.3..1
 projectId            | Sun_2018
 studyType            | pqtl
 traitFromSource      | HLA.DQA2.7757.5.3..1
 geneId               | ENSG00000237541
 tissueFromSourceId   | UBERON_0001969
 nSamples             | 3301
 summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
 hasSumstats          | true
-RECORD 1------------------------------------------------------------------------------------------
 studyId              | Sun_2018_plasma_TAC1.9337.43.3..1
 projectId            | Sun_2018
 studyType            | pqtl
 traitFromSource      | TAC1.9337.43.3..1
 geneId               | ENSG00000006128
 tissueFromSourceId   | UBERON_0001969
 nSamples             | 3301
 summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
 hasSumstats          | true
-RECORD 2------------------------------------------------------------------------------------------
 studyId              | Sun_2018_plasma_IL5RA.13686.2.3..1
 projectId            | Sun_2018
 studyType            | pqtl
 traitFromSource      | IL5RA.13686.2.3..1
 geneId               | ENSG00000091181
 tissueFromSourceId   | UBERON_0001969
 nSamples             | 3301
 summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
 hasSumstats          | true
only showing top 3 rows

The critical fields are:

Knowing what protein isoforms (e.g. Olink ID or uniprot accession) are nice to have but are not critical. You can freely play with traitFromSource or even concatenate several ids in a way that they are readable.

As you can see, traitFromSourceMappedIds is not populated. That's fine because we don't use QTLs to derive Target-Disease/Phenotype evidence, only variant/locus to gene.

tskir commented 3 months ago

@d0choa @addramir Test runs revealed two further problems with the data. One is very easy and the other is quite complex, so I'll post it as a separate comment.

1. One entry from metadata table does not correspond to a summary stats file

For all ancestries (I'm now ingesting just Europeans but checked for all), all but one entities in the original metadata table correspond to a summary stats file. Likewise, all summary stats file have a corresponding metadata entity (some more than one — more on that below!)

However, the GLIPR1:P48060:OID31517:v1 entry in the metadata table does not correspond to any summary stats files. There's not much to do other than manually mark and skip it, which I will do.

tskir commented 3 months ago

2. Some summary stats correspond to a measurement of a mixture of proteins

For some summary stats files (to be exact, 15 of them) the metadata table contains more than one entry.

Example 1: α-Amylase

UKBPPP_ProteinID olink_target_fullname OlinkID UniProt Assay Panel Panel_Lot_Nr UniProt2 HGNC.symbol ensembl_id chr gene_start gene_end Strand Dilution_factor block expansion
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1 Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C OID30707 P0DUB6_P0DTE7_P0DTE8 AMY1A_AMY1B_AMY1C Inflammation_II B20704 P0DUB6 AMY1A ENSG00000237763 1 103655760 103664554 1 17:40 C yes
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1 Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C OID30707 P0DUB6_P0DTE7_P0DTE8 AMY1A_AMY1B_AMY1C Inflammation_II B20704 P0DTE7 AMY1B ENSG00000174876 1 103687415 103696454 -1 17:40 C yes
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1 Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C OID30707 P0DUB6_P0DTE7_P0DTE8 AMY1A_AMY1B_AMY1C Inflammation_II B20704 P0DTE8 AMY1C ENSG00000187733 1 103745323 103758726 1 17:40 C yes

My understanding is that, in the assay, they are measuring total alpha amylase, which in reality consists of three subunits coded by three genes, and so they report it this way.

Example 2: Interleukin-12

Note that, in the first example, all subunits are on the same chromosome and even roughly in the same region. This is true for all but one examples. But for Interleukin-12, its two subunits are on different chromosomes:

UKBPPP_ProteinID olink_target_fullname OlinkID UniProt Assay Panel Panel_Lot_Nr UniProt2 HGNC.symbol ensembl_id chr gene_start gene_end Strand Dilution_factor block expansion
IL12A_IL12B:P29459_P29460:OID21327:v1 Interleukin-12 OID21327 P29459_P29460 IL12A_IL12B Oncology B04412 P29459 IL12A ENSG00000168811 3 159988835 159996019 1 1:1 B no
IL12A_IL12B:P29459_P29460:OID21327:v1 Interleukin-12 OID21327 P29459_P29460 IL12A_IL12B Oncology B04412 P29460 IL12B ENSG00000113302 5 159314780 159330487 -1 1:1 B no

All affected targets

Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C Beta-defensin 103 Beta-defensin 104 Beta-defensin 4A BolA-like protein 2 Cancer/testis antigen 1 Choriogonadotropin subunit beta 3 Creatine kinase U-type, mitochondrial Galactoside 3_5 Galectin-7 Interleukin-12 Interleukin-27 MHC class I polypeptide-related sequence B_A Neutrophil defensin 1 Sperm acrosome-associated protein 5

Possible solutions

  1. Simply ignore those 15 targets for now and do not ingest
  2. Create duplicate study index entries, so that N entries in the original metadata table = N entries in the study index table, all linking to the same summary stats file. This will be problematic because my understanding is that we have always assumed that 1 study index record = 1 summary stats file, which are linked by study ID, and this will almost certainly break something.
  3. Cram all information into a single study index entry. This will be even more problematic because, for one, we currently assume that 1 study index entry for pQTL = 1 gene, which will no longer be the case.
  4. For each set of “duplicated” metadata table entries, keep only one. This leads to loss of information and potential inconsistencies. We'll also need to decide how to choose which gene to keep.
  5. Massively rework study index / summary stats structure and their relationship (just for 15 genes in one study?..)

TBH none of those solutions strike me as very good, so @d0choa @addramir I'd love to hear your opinions on this

d0choa commented 3 months ago

I had a brief discussion with @DSuveges and we concluded a solution around #2 is probably the safest bet here.

The rationale is that we don't want to lose gene assignments for the main use of this data (colocalisation/MR). Having replicated signals for multiple genes producing the same gene product is probably safer than losing QTL assignments. The mistake we will be committing is that we don't know if each of these genes is the one being transcribed but, if not, it would be another gene producing the same gene product.

To ensure this doesn't break anything, the best option would be to use a non-redundant studyIndexId. In that way, you would ensure that they are treated as independent studies (even if they come from the same summary stats). That would be safe downstream, but let us know if this causes any issues with the harmonisation.

tskir commented 3 months ago

Transformation into a Spark-compatible format complete

The data for all 2 957 studies can be found here: gs://gentropy-tmp/batch/output/ukb_ppp_eur. Each now has a unique study ID. The data is pre-partitioned by study and chromosome.

Next steps

I am now working on the final action which is the actual data harmonisation pipeline. I expect to have test runs this week and some results next week.

tskir commented 2 months ago

Status update

A lot of the developments and discussions have been happening internally on this issue, so just to provide a status update:

The variant annotation dataset was updated by @DSuveges 2024-04-18 as well. I'm currently investigating the data against the updated dataset, as well as looking into including chromosome X data in addition to autosomes.

tskir commented 2 months ago

Metrics and plots with the previous version of the variant annotation dataset

This is for comparison/baseline purposes only, I'll post the updated metrics with the new dataset version shortly.

Comments on metrics and plots

Metrics

Plots

Decision

For the v1 and v2 versions of the harmonised UKB PPP data, as discussed with @addramir and @Daniel-Considine, the decision was to harmonise all variant types except for [flip] [snp_c], which clearly display a lack of AF concordance.

addramir commented 2 months ago

@tskir for some reason image plot dosn't work.

tskir commented 2 months ago

Metrics and plots with the new version of the variant annotation dataset

Notes

Metrics

As we can see, the metrics are essentially unchange, which is good. This means the new harmonisation pipeline handles indels from the new variant annotation dataset well.

Plots

Conclusion

For autosomes, it looks like we are good to go with the updated variant annotation dataset with the same decision re: variant types as before.

tskir commented 2 months ago

@addramir Thank you! Fixed the plots in both comments now

addramir commented 2 months ago

Thank you! All looks good for me.

tskir commented 2 months ago

Checking up on chromosome X

UKB PPP data likes to spice things up a bit occasionally, so while in the file name the chromosome is called chrX (for example, discovery_chrX_VWF:P04275:OID20256:v1:Cardiometabolic.gz), inside the file the value for the chromosome name is actually recorded as 23. I didn't quite see this coming, so the original quickly prepared harmonised data (v1, v2) did not include chromosome X.

But this actually presented a good opportunity to plot it separately now and to verify that the allele frequencies behave in the same way as for the autosomes, which luckily they are. So for the upcoming v3 data release, I will proceed to add chrX and process it in the same way as the autosomes.

tskir commented 1 month ago

:zap: UKB PPP (EUR) harmonised data v3 is ready

List of changes compared to v2

General

Study index

Summary stats

Metrics for the sample study

As for all previous metrics, I'm using UKB_PPP_EUR_A1BG_P04217_OID30771_v1.

The first three metrics are unchanged because previously, chrX variants were present at these stages and were counted, but they were labeled as chromosome 23, so they didn't go into subsequent stages.

tskir commented 1 month ago

:zap: Clumping & locus collection is done

Results

Code and parameters used

Import

from gentropy.common.session import Session
from gentropy.method.window_based_clumping import WindowBasedClumping
from gentropy.dataset.summary_statistics import SummaryStatistics
from pyspark.sql.functions import col
from gentropy.dataset.study_locus import StudyLocus

Set up

session = Session()

Clump

%%time
path_sumstats="gs://ukb_ppp_eur_data/summary_stats"
L=SummaryStatistics.from_parquet(session=session,path=path_sumstats)
slt=WindowBasedClumping.clump(L,gwas_significance=1.7e-11,distance=5e5)
slt.df.write.parquet("gs://ukb_ppp_eur_data/clumped")

Collect

%%time
sl=StudyLocus.from_parquet(session=session,path="gs://ukb_ppp_eur_data/clumped")
df=sl.df
df = df.filter(~((col("chromosome") == "6") & (col("position").between(25000000, 34000000))))
df.count()
sl.df=df
wbc=sl.annotate_locus_statistics(summary_statistics=L,collect_locus_distance=1000000)
wbc.df.write.parquet("gs://ukb_ppp_eur_data/collected")
addramir commented 1 month ago

Thank you! We just briefly discussed with @d0choa - could you please make a parametrisable step based on you clumping+collection code?

tskir commented 1 week ago

Overview: full UKB PPP (EUR) ingestion & harmonisation code

I've now finished uploading all code that I have related to the UKB PPP (EUR) datasource. Currently, three stages are required for the entire processing pipeline.

1. Mirror

2. Preprocess

3. Ingest & harmonise