tskir commented 3 months ago

Notes on data

Storage

UKB PPP data is fully mirrored in the Gentropy Vault.
For each study we have one TAR file which contains per-chromosome gzipped tabular files. Because of that, some additional processing will need to happen before we can feed the data into Spark.

Format

The primary columns are: CHR, POS, ALLELE0, ALLELE1, ID (encoded as CHR:POS:ALLELE0:ALLELE1:imp:v1)
ID is computed based on GRCh37 while POS is based on GRCh38. This means that the existing ID will field be ignored and re-assembled from other fields.

Processing approach

Scope

At this stage, we will only use the European (discovery) cohort.

Harmonisation

ALLELE0 and ALLELE1 do not necessarily correspond to REF and ALT. Hence, we adopt the following algorithm:

If CHR_POS_ALLELE0_ALELLE1 variant is present in the variant index, record it as is.
If CHR_POS_ALLELE1_ALLELE0 variant (flipped) is present in the variant index:
- Flip the alleles
- Allele frequency = 1 − original allele frequency
- Beta = − original beta
If neither variant is present in the variant index, discard it.

Filtering

We will use hard filtering as described below, meaning that variants that don't pass the filters are fully discarded and will not be present in the output

Discard variants with INFO < 0.8
Discard variants with minor allele frequency < 1e−4
Discard variants that are not present in the variant index (see above)

After filtering, we expect to observe anywhere from 13 to 20 million SNPs

Special cases

Data contains X chromosome, it should be processed in the same way as autosomes
Data contains pseudoautosomal region of Y chromosome, designated as XY. This needs to also be processed and, if present in the variant index, recorded as chromosome Y.

tskir commented 3 months ago

CC @addramir @d0choa @DSuveges

DSuveges commented 3 months ago

Just double checking, when you refer to variant index, it is the full GnomAD4 variant annotation table, right? These decisions sounds solid, just confirming we are dropping variants if they are interpreted on the reverse strand? (I think this is fine choice)

Hard filtering is fine, however as an exploratory step, it would be great to see some stats on:

The number of single point associations that were passing as is.
The number of associations that needed to be flipped.
The number of associations where the chr:pos matched a variant in GnomAD, but the alleles were discordant
The number of associations that were dropped because there were no matching chr:pos.

These tests are addind some extra work, however getting these numbers for at least one study would be very informative and would knowledge on what is actually happening.

After filtering, we expect to observe anywhere from 13 to 20 million SNPs

Just curious, what is the starting number?

tskir commented 3 months ago

Just double checking, when you refer to variant index, it is the full GnomAD4 variant annotation table, right?

Thank you for bringing this up, this is actually something I also wanted to confirm with @addramir. I didn't work with the variant index before, so I'd appreciate the link to which exact dataset I should use.

These decisions sounds solid, just confirming we are dropping variants if they are interpreted on the reverse strand?

I think (@addramir also might be better suited to confirm/deny this) that we expect all variants to be reported on the forward strand, the main issue is that sometimes alleles will be flipped so that ALLELE1 is reference and ALLELE0 is reference. So we don't expect a lot of cases where neither ID is found in the variant index and we have to drop it.

as an exploratory step, it would be great to see some stats on

I will definitely include the detailed statistics on how many variants dropped for which reason, not just for one study but generally in the harmonisation logs. It's going to be useful for debugging and for downstream analysis.

Just curious, what is the starting number?

I've chosen a few random studies and the results were: 15 933 877, 15 933 061, 15 952 258.

d0choa commented 3 months ago

@tskir as for the gzipped files. This worked quite smoothly for the eqtl catalogue. You are working with 10 times more data, so we will see... https://github.com/opentargets/gentropy/blob/dev/src/airflow/dags/eqtl_preprocess.py

tskir commented 3 months ago

@d0choa Unfortunately that template won't work here: it expects a folder with a bunch of .gz files, while what we have is a folder with a bunch of TARs, each of which has a bunch of .gz files inside of it.

But those templates are based on Dataflow, and since I worked with it recently, I should be able to write a simple pipeline to do the decompression, so thank you for the pointer either way!

tskir commented 3 months ago

Three questions about populating the study index

For pQTL studies like UKB PPP, what do we put into traitFromSource / traitFromSourceMappedIds fields? Do we just not populate them? Do we put some term for “protein expression / levels”? I distinctly remember there used to be a discussion about this but I don't think we decided on anything.
For the studyId field, do I just use the original ID and the UKB_PPP prefix? The result would look like this: _UKB_PPP_AARSD1_Q9BTE6_OID21311v1.
For each study, the original metadata also contains panel information, for example “Oncology” or “Inflammation”. Do we need this information at all? If so, where do I put it? I don't think there is a field for it, so it might need to be created if we want this.

addramir commented 3 months ago

I thought it should be liek the name of the file or the name of the GWAS from their meta-data file without adding anything else (like "levels" or similar). What is "traitFromSourceMappedIds" btw?
Hm, I would use something like this "UKB_PPPEUR{"target_id"}" where EUR is the population we are ingesting.
I would drop it.

tskir commented 3 months ago

@addramir

2, 3 — thank you, got it

I thought it should be liek the name of the file or the name of the GWAS from their meta-data file without adding anything else (like "levels" or similar). What is "traitFromSourceMappedIds" btw?

The problem is that there is no trait explicitly specified. Here are all of the metadata fields for one example study:

UKBPPP_ProteinID	AARSD1:Q9BTE6:OID21311:v1
olink_target_fullname	Alanyl-tRNA editing protein Aarsd1
OlinkID	OID21311
UniProt	Q9BTE6
Assay	AARSD1
Panel	Oncology
Panel_Lot_Nr	B04412
UniProt2	Q9BTE6
HGNC.symbol	AARSD1
ensembl_id	ENSG00000266967
chr	17
gene_start	42950526
gene_end	42964498
Strand	-1
Dilution_factor	1:1
block	B
expansion	no

In theory, traitFromSource should contain the original description of a trait being measured (which it is lacking in this case), and traitFromSourceMappedIds would contain one (or many) mapped ontology IDs representing the same trait. So for regular GWASes, traitFromSource might be Alzheimer's disease and traitFromSourceMappedIds could be [EFO:1001870].

Here, the implied trait is “expression of a (given protein)”, which I'm not sure how to correctly represent. Especially given that we already have the separate "gene" column in the study index. Since the context should be clear from the type of study (pQTL) + gene, we might just drop those fields and it would be fine.

@d0choa Do you have any thoughts on this?

addramir commented 3 months ago

I see. Well, my guess, that traitFromSource should be "olink_target_fullname" and traitFromSourceMappedIds should be the same for all: http://www.ebi.ac.uk/efo/EFO_0007937 . Don't forget we need a column that represents the target_id and links the protein and the gene (ensembl_id?), I don't know what is the name for this column in study_index.

d0choa commented 3 months ago

Without going into detail, we want to be consistent with what we have for eQTL Catalogue pQTLs (Sun et al.) code.

In [14]: df.filter(f.col("studyType") == "pqtl").show(3, vertical=True, truncate = False)
-RECORD 0------------------------------------------------------------------------------------------
 studyId              | Sun_2018_plasma_HLA.DQA2.7757.5.3..1
 projectId            | Sun_2018
 studyType            | pqtl
 traitFromSource      | HLA.DQA2.7757.5.3..1
 geneId               | ENSG00000237541
 tissueFromSourceId   | UBERON_0001969
 nSamples             | 3301
 summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
 hasSumstats          | true
-RECORD 1------------------------------------------------------------------------------------------
 studyId              | Sun_2018_plasma_TAC1.9337.43.3..1
 projectId            | Sun_2018
 studyType            | pqtl
 traitFromSource      | TAC1.9337.43.3..1
 geneId               | ENSG00000006128
 tissueFromSourceId   | UBERON_0001969
 nSamples             | 3301
 summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
 hasSumstats          | true
-RECORD 2------------------------------------------------------------------------------------------
 studyId              | Sun_2018_plasma_IL5RA.13686.2.3..1
 projectId            | Sun_2018
 studyType            | pqtl
 traitFromSource      | IL5RA.13686.2.3..1
 geneId               | ENSG00000091181
 tissueFromSourceId   | UBERON_0001969
 nSamples             | 3301
 summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
 hasSumstats          | true
only showing top 3 rows

The critical fields are:

studyId needs to be unique
studyType needs to be pqtl
geneId will be used for L2G
tissueFromSourceId which in your case it will always be UBERON_0001969

Knowing what protein isoforms (e.g. Olink ID or uniprot accession) are nice to have but are not critical. You can freely play with traitFromSource or even concatenate several ids in a way that they are readable.

As you can see, traitFromSourceMappedIds is not populated. That's fine because we don't use QTLs to derive Target-Disease/Phenotype evidence, only variant/locus to gene.

tskir commented 3 months ago

@d0choa @addramir Test runs revealed two further problems with the data. One is very easy and the other is quite complex, so I'll post it as a separate comment.

1. One entry from metadata table does not correspond to a summary stats file

For all ancestries (I'm now ingesting just Europeans but checked for all), all but one entities in the original metadata table correspond to a summary stats file. Likewise, all summary stats file have a corresponding metadata entity (some more than one — more on that below!)

However, the GLIPR1:P48060:OID31517:v1 entry in the metadata table does not correspond to any summary stats files. There's not much to do other than manually mark and skip it, which I will do.

tskir commented 3 months ago

2. Some summary stats correspond to a measurement of a mixture of proteins

For some summary stats files (to be exact, 15 of them) the metadata table contains more than one entry.

Example 1: α-Amylase

UKBPPP_ProteinID	olink_target_fullname	OlinkID	UniProt	Assay	Panel	Panel_Lot_Nr	UniProt2	HGNC.symbol	ensembl_id	chr	gene_start	gene_end	Strand	Dilution_factor	block	expansion
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1	Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C	OID30707	P0DUB6_P0DTE7_P0DTE8	AMY1A_AMY1B_AMY1C	Inflammation_II	B20704	P0DUB6	AMY1A	ENSG00000237763	1	103655760	103664554	1	17:40	C	yes
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1	Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C	OID30707	P0DUB6_P0DTE7_P0DTE8	AMY1A_AMY1B_AMY1C	Inflammation_II	B20704	P0DTE7	AMY1B	ENSG00000174876	1	103687415	103696454	-1	17:40	C	yes
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1	Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C	OID30707	P0DUB6_P0DTE7_P0DTE8	AMY1A_AMY1B_AMY1C	Inflammation_II	B20704	P0DTE8	AMY1C	ENSG00000187733	1	103745323	103758726	1	17:40	C	yes

My understanding is that, in the assay, they are measuring total alpha amylase, which in reality consists of three subunits coded by three genes, and so they report it this way.

Example 2: Interleukin-12

Note that, in the first example, all subunits are on the same chromosome and even roughly in the same region. This is true for all but one examples. But for Interleukin-12, its two subunits are on different chromosomes:

UKBPPP_ProteinID	olink_target_fullname	OlinkID	UniProt	Assay	Panel	Panel_Lot_Nr	UniProt2	HGNC.symbol	ensembl_id	chr	gene_start	gene_end	Strand	Dilution_factor	block	expansion
IL12A_IL12B:P29459_P29460:OID21327:v1	Interleukin-12	OID21327	P29459_P29460	IL12A_IL12B	Oncology	B04412	P29459	IL12A	ENSG00000168811	3	159988835	159996019	1	1:1	B	no
IL12A_IL12B:P29459_P29460:OID21327:v1	Interleukin-12	OID21327	P29459_P29460	IL12A_IL12B	Oncology	B04412	P29460	IL12B	ENSG00000113302	5	159314780	159330487	-1	1:1	B	no

All affected targets

Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C Beta-defensin 103 Beta-defensin 104 Beta-defensin 4A BolA-like protein 2 Cancer/testis antigen 1 Choriogonadotropin subunit beta 3 Creatine kinase U-type, mitochondrial Galactoside 3_5 Galectin-7 Interleukin-12 Interleukin-27 MHC class I polypeptide-related sequence B_A Neutrophil defensin 1 Sperm acrosome-associated protein 5

Possible solutions

Simply ignore those 15 targets for now and do not ingest
Create duplicate study index entries, so that N entries in the original metadata table = N entries in the study index table, all linking to the same summary stats file. This will be problematic because my understanding is that we have always assumed that 1 study index record = 1 summary stats file, which are linked by study ID, and this will almost certainly break something.
Cram all information into a single study index entry. This will be even more problematic because, for one, we currently assume that 1 study index entry for pQTL = 1 gene, which will no longer be the case.
For each set of “duplicated” metadata table entries, keep only one. This leads to loss of information and potential inconsistencies. We'll also need to decide how to choose which gene to keep.
Massively rework study index / summary stats structure and their relationship (just for 15 genes in one study?..)

TBH none of those solutions strike me as very good, so @d0choa @addramir I'd love to hear your opinions on this

d0choa commented 3 months ago

I had a brief discussion with @DSuveges and we concluded a solution around #2 is probably the safest bet here.

The rationale is that we don't want to lose gene assignments for the main use of this data (colocalisation/MR). Having replicated signals for multiple genes producing the same gene product is probably safer than losing QTL assignments. The mistake we will be committing is that we don't know if each of these genes is the one being transcribed but, if not, it would be another gene producing the same gene product.

To ensure this doesn't break anything, the best option would be to use a non-redundant studyIndexId. In that way, you would ensure that they are treated as independent studies (even if they come from the same summary stats). That would be safe downstream, but let us know if this causes any issues with the harmonisation.

tskir commented 3 months ago

Transformation into a Spark-compatible format complete

The data for all 2 957 studies can be found here: gs://gentropy-tmp/batch/output/ukb_ppp_eur. Each now has a unique study ID. The data is pre-partitioned by study and chromosome.

Next steps

I am now working on the final action which is the actual data harmonisation pipeline. I expect to have test runs this week and some results next week.

tskir commented 2 months ago

Status update

A lot of the developments and discussions have been happening internally on this issue, so just to provide a status update:

[x] Draft harmonisation pipeline was complete 2024-04-05.
[x] Data exploration stage including AF/AF plots was complete 2024-04-10, and decision was reached for final harmonisation pipeline parameters and which variant types to ingest.
[x] Harmonised data v1 processed and delivered for all studies 2024-04-16.
[x] Harmonised data v2 processed and delivered for all studies 2024-04-18, which included a fix to p-value conversion from original data.

The variant annotation dataset was updated by @DSuveges 2024-04-18 as well. I'm currently investigating the data against the updated dataset, as well as looking into including chromosome X data in addition to autosomes.

tskir commented 2 months ago

Metrics and plots with the previous version of the variant annotation dataset

This is for comparison/baseline purposes only, I'll post the updated metrics with the new dataset version shortly.

Comments on metrics and plots

This includes all variant types on all autosomes only.
Variants are categorised according to how their alleles matched the variant annotation dataset:
- Direct: A0 in UKB PPP = REF, A1 = ALT
- Flip: A1 in UKB PPP = REF, A0 = ALT
Variants are also split into three types:
- snp_c: SNP which replaces a nucleotide with its complement, so four types: A → T, T → A, G → C, C → G
- snp_n: All other SNPs
- indel: Indels, defined as variants where the length of at least one allele is > 1.

Metrics

All variants: 16 044 498
INFO ≥ 0.8: 14 496 921
FREQ ≥ 1e–4: 14 496 921 (nothing excluded because data seems to be already pre-filtered)
Variant index match: 14 354 033
- Original alleles (A0 = REF): 13 896 197
- Flipped alleles (A1 = REF): 457 836
Have allele frequencies in both UKB PPP and VA datasets: 14 353 489 (only these variants can be plotted on AF/AF plots below)

Plots

Decision

For the v1 and v2 versions of the harmonised UKB PPP data, as discussed with @addramir and @Daniel-Considine, the decision was to harmonise all variant types except for [flip] [snp_c], which clearly display a lack of AF concordance.

addramir commented 2 months ago

@tskir for some reason image plot dosn't work.

tskir commented 2 months ago

Metrics and plots with the new version of the variant annotation dataset

Notes

This is the version of variant annotation dataset produced by @DSuveges 2024-04-18.
Because the indel positions were fixed with this update, the harmonisation pipeline now does not add +1 to indel positions.
The data here still only covers the autosomes.

Metrics

All variants: 16 044 498 (no change, as expected)
INFO ≥ 0.8: 14 496 921 (no change, as expected) FREQ ≥ 1e–4: 14 496 921 (no change, as expected)
Variant index match: 14 352 842 (0.01% reduction)
- Original alleles (A0 = REF): 13 894 891 (0.01% reduction)
- Flipped alleles (A1 = REF): 457 951 (0.02% increase)
Have allele frequencies in both UKB PPP and VA datasets: 14 352 842 (<0.01% reduction)

As we can see, the metrics are essentially unchange, which is good. This means the new harmonisation pipeline handles indels from the new variant annotation dataset well.

Plots

Conclusion

For autosomes, it looks like we are good to go with the updated variant annotation dataset with the same decision re: variant types as before.

tskir commented 2 months ago

@addramir Thank you! Fixed the plots in both comments now

addramir commented 2 months ago

Thank you! All looks good for me.

tskir commented 2 months ago

Checking up on chromosome X

UKB PPP data likes to spice things up a bit occasionally, so while in the file name the chromosome is called chrX (for example, discovery_chrX_VWF:P04275:OID20256:v1:Cardiometabolic.gz), inside the file the value for the chromosome name is actually recorded as 23. I didn't quite see this coming, so the original quickly prepared harmonised data (v1, v2) did not include chromosome X.

But this actually presented a good opportunity to plot it separately now and to verify that the allele frequencies behave in the same way as for the autosomes, which luckily they are. So for the upcoming v3 data release, I will proceed to add chrX and process it in the same way as the autosomes.

tskir commented 1 month ago

:zap: UKB PPP (EUR) harmonised data v3 is ready

List of changes compared to v2

General

Improved processing performance and output partitioning structure. This is achieved by pre-computing the _variantannotation dataset, by tweaking the number of RDDs for shuffling, and by optimising joins. The entire harmonisation pipeline now runs in <1 hour on ~70 (average number) 16-core VMs, and costs just ~$10 per run in compute time.

Study index

Corrected sample size. It's now computed from peeking into the first row of every corresponding original summary stats file.
Added population structure. Assuming everyone is European (NFE).

Summary stats

Using the updated variant annotation dataset. Correspondingly, not adding 1 to indel positions anymore.
Added chromosome X data. It is recorded as “chromosome 23” at source.
Added sample size. This is simply the N column from the original data.
Proper p-value calculation. Now using neglog_pvalue_to_mantissa_and_exponent on the LOG10P column.

Metrics for the sample study

As for all previous metrics, I'm using UKB_PPP_EUR_A1BG_P04217_OID30771_v1.

All variants: 16 044 498
INFO ≥ 0.8: 14 496 921
FREQ ≥ 1e–4: 14 496 921
Variant index match: 14 873 176
- Original alleles (A0 = REF): 14 391 059
- Flipped alleles (A1 = REF): 482 117
Kept after filtering out low quality variants (snp_c in the flipped orientation): 14 867 681

The first three metrics are unchanged because previously, chrX variants were present at these stages and were counted, but they were labeled as chromosome 23, so they didn't go into subsequent stages.

tskir commented 1 month ago

:zap: Clumping & locus collection is done

Results

Clumped data: gs://ukb_ppp_eur_data/clumped, total 20 814 records
Collected loci: gs://ukb_ppp_eur_data/collected, total 18 524 records

Code and parameters used

Import

from gentropy.common.session import Session
from gentropy.method.window_based_clumping import WindowBasedClumping
from gentropy.dataset.summary_statistics import SummaryStatistics
from pyspark.sql.functions import col
from gentropy.dataset.study_locus import StudyLocus

Set up

session = Session()

Clump

%%time
path_sumstats="gs://ukb_ppp_eur_data/summary_stats"
L=SummaryStatistics.from_parquet(session=session,path=path_sumstats)
slt=WindowBasedClumping.clump(L,gwas_significance=1.7e-11,distance=5e5)
slt.df.write.parquet("gs://ukb_ppp_eur_data/clumped")

Collect

%%time
sl=StudyLocus.from_parquet(session=session,path="gs://ukb_ppp_eur_data/clumped")
df=sl.df
df = df.filter(~((col("chromosome") == "6") & (col("position").between(25000000, 34000000))))
df.count()
sl.df=df
wbc=sl.annotate_locus_statistics(summary_statistics=L,collect_locus_distance=1000000)
wbc.df.write.parquet("gs://ukb_ppp_eur_data/collected")

addramir commented 1 month ago

Thank you! We just briefly discussed with @d0choa - could you please make a parametrisable step based on you clumping+collection code?

tskir commented 1 week ago

Overview: full UKB PPP (EUR) ingestion & harmonisation code

I've now finished uploading all code that I have related to the UKB PPP (EUR) datasource. Currently, three stages are required for the entire processing pipeline.

1. Mirror

Input. Original data is hosted on Synapse.
Transformation.
- As we decided in the past, we want to keep the copy of the original data in the Vault.
- Protocol is available here: https://github.com/opentargets/gentropy-vault/blob/main/datasets/ukb-ppp.md.
- The protocol must be run manually.
Output. The output of this step is kept forever in the Vault.

2. Preprocess

Input. The mirrored data from the previous step.
Transformation.
- The data which we mirrored during the previous steps has to undergo several specific transformations which aren't achievable in Spark (especially the first one):
- Extract gzipped per-chromosome files from inside the individual TAR files, uncompress, partition by chromosome
- Recreate the study ID. This is required because multiple rows in the study index can reuse the same summary stats file
- Drop certain rows which don't have a corresponding summary stats file
- This transformation is done using Google Batch. The code can be found in this new repository: https://github.com/opentargets/gentropy-input-support. The UKB PPP-specific part is this one: https://github.com/opentargets/gentropy-input-support/blob/dc5f8f7aee13a066933f3fd5b18a9b3a5ca71069/data_sources.py#L43-L103.
- This step must also be triggered manually, how to do this is described in the repository.
Output. Precursors of study index and summary stats datasets are output. Because we decided that we don't want to keep the data twice, the output of this step is only kept temporarily and is deleted after 60 days according to the gs://gentropy-tmp bucket lifecycle rules.

3. Ingest & harmonise

Input. The preprocessed data from the previous step.
Transformation.
- The data is ingested according to the usual study index & summary stats schemas.
- The variant positions, alleles, and allele frequencies are also harmonised according to all of the discussions which happened during this issue.
- The code is available in this pull request: https://github.com/opentargets/gentropy/pull/652. Once this PR is closed, that concludes this issue.
Output. Study index and harmonised summary stats, kept forever as usual.

opentargets / issues

Harmonise UKB PPP data #3234

Notes on data

Storage

Format

Processing approach

Scope

Harmonisation

Filtering

Special cases

Three questions about populating the study index

1. One entry from metadata table does not correspond to a summary stats file

2. Some summary stats correspond to a measurement of a mixture of proteins

Example 1: α-Amylase

Example 2: Interleukin-12

All affected targets

Possible solutions

Transformation into a Spark-compatible format complete

Next steps

Status update

Metrics and plots with the previous version of the variant annotation dataset

Comments on metrics and plots

Metrics

Plots

Decision

Metrics and plots with the new version of the variant annotation dataset

Notes

Metrics

Plots

Conclusion

Checking up on chromosome X

:zap: UKB PPP (EUR) harmonised data v3 is ready

List of changes compared to v2

General

Study index

Summary stats

Metrics for the sample study

:zap: Clumping & locus collection is done

Results

Code and parameters used

Import

Set up

Clump

Collect

Overview: full UKB PPP (EUR) ingestion & harmonisation code

1. Mirror

2. Preprocess

3. Ingest & harmonise