Closed tskir closed 1 week ago
CC @addramir @d0choa @DSuveges
Just double checking, when you refer to variant index, it is the full GnomAD4 variant annotation table, right? These decisions sounds solid, just confirming we are dropping variants if they are interpreted on the reverse strand? (I think this is fine choice)
Hard filtering is fine, however as an exploratory step, it would be great to see some stats on:
These tests are addind some extra work, however getting these numbers for at least one study would be very informative and would knowledge on what is actually happening.
After filtering, we expect to observe anywhere from 13 to 20 million SNPs
Just curious, what is the starting number?
Just double checking, when you refer to variant index, it is the full GnomAD4 variant annotation table, right?
Thank you for bringing this up, this is actually something I also wanted to confirm with @addramir. I didn't work with the variant index before, so I'd appreciate the link to which exact dataset I should use.
These decisions sounds solid, just confirming we are dropping variants if they are interpreted on the reverse strand?
I think (@addramir also might be better suited to confirm/deny this) that we expect all variants to be reported on the forward strand, the main issue is that sometimes alleles will be flipped so that ALLELE1 is reference and ALLELE0 is reference. So we don't expect a lot of cases where neither ID is found in the variant index and we have to drop it.
as an exploratory step, it would be great to see some stats on
I will definitely include the detailed statistics on how many variants dropped for which reason, not just for one study but generally in the harmonisation logs. It's going to be useful for debugging and for downstream analysis.
Just curious, what is the starting number?
I've chosen a few random studies and the results were: 15 933 877, 15 933 061, 15 952 258.
@tskir as for the gzipped files. This worked quite smoothly for the eqtl catalogue. You are working with 10 times more data, so we will see... https://github.com/opentargets/gentropy/blob/dev/src/airflow/dags/eqtl_preprocess.py
@d0choa Unfortunately that template won't work here: it expects a folder with a bunch of .gz files, while what we have is a folder with a bunch of TARs, each of which has a bunch of .gz files inside of it.
But those templates are based on Dataflow, and since I worked with it recently, I should be able to write a simple pipeline to do the decompression, so thank you for the pointer either way!
traitFromSource
/ traitFromSourceMappedIds
fields? Do we just not populate them? Do we put some term for “protein expression / levels”? I distinctly remember there used to be a discussion about this but I don't think we decided on anything.studyId
field, do I just use the original ID and the UKB_PPP prefix? The result would look like this: _UKB_PPP_AARSD1_Q9BTE6_OID21311v1.@addramir
2, 3 — thank you, got it
- I thought it should be liek the name of the file or the name of the GWAS from their meta-data file without adding anything else (like "levels" or similar). What is "traitFromSourceMappedIds" btw?
The problem is that there is no trait explicitly specified. Here are all of the metadata fields for one example study:
UKBPPP_ProteinID | AARSD1:Q9BTE6:OID21311:v1 |
---|---|
olink_target_fullname | Alanyl-tRNA editing protein Aarsd1 |
OlinkID | OID21311 |
UniProt | Q9BTE6 |
Assay | AARSD1 |
Panel | Oncology |
Panel_Lot_Nr | B04412 |
UniProt2 | Q9BTE6 |
HGNC.symbol | AARSD1 |
ensembl_id | ENSG00000266967 |
chr | 17 |
gene_start | 42950526 |
gene_end | 42964498 |
Strand | -1 |
Dilution_factor | 1:1 |
block | B |
expansion | no |
In theory, traitFromSource
should contain the original description of a trait being measured (which it is lacking in this case), and traitFromSourceMappedIds
would contain one (or many) mapped ontology IDs representing the same trait. So
for regular GWASes, traitFromSource
might be Alzheimer's disease and traitFromSourceMappedIds
could be [EFO:1001870].
Here, the implied trait is “expression of a (given protein)”, which I'm not sure how to correctly represent. Especially given that we already have the separate "gene" column in the study index. Since the context should be clear from the type of study (pQTL) + gene, we might just drop those fields and it would be fine.
@d0choa Do you have any thoughts on this?
I see. Well, my guess, that traitFromSource should be "olink_target_fullname" and traitFromSourceMappedIds should be the same for all: http://www.ebi.ac.uk/efo/EFO_0007937 . Don't forget we need a column that represents the target_id and links the protein and the gene (ensembl_id?), I don't know what is the name for this column in study_index.
Without going into detail, we want to be consistent with what we have for eQTL Catalogue
pQTLs (Sun et al.) code.
In [14]: df.filter(f.col("studyType") == "pqtl").show(3, vertical=True, truncate = False)
-RECORD 0------------------------------------------------------------------------------------------
studyId | Sun_2018_plasma_HLA.DQA2.7757.5.3..1
projectId | Sun_2018
studyType | pqtl
traitFromSource | HLA.DQA2.7757.5.3..1
geneId | ENSG00000237541
tissueFromSourceId | UBERON_0001969
nSamples | 3301
summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
hasSumstats | true
-RECORD 1------------------------------------------------------------------------------------------
studyId | Sun_2018_plasma_TAC1.9337.43.3..1
projectId | Sun_2018
studyType | pqtl
traitFromSource | TAC1.9337.43.3..1
geneId | ENSG00000006128
tissueFromSourceId | UBERON_0001969
nSamples | 3301
summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
hasSumstats | true
-RECORD 2------------------------------------------------------------------------------------------
studyId | Sun_2018_plasma_IL5RA.13686.2.3..1
projectId | Sun_2018
studyType | pqtl
traitFromSource | IL5RA.13686.2.3..1
geneId | ENSG00000091181
tissueFromSourceId | UBERON_0001969
nSamples | 3301
summarystatsLocation | https://ftp.ebi.ac.uk/pub/databases/spot/eQTL/sumstats/QTS000035/QTD000584
hasSumstats | true
only showing top 3 rows
The critical fields are:
studyId
needs to be uniquestudyType
needs to be pqtl
geneId
will be used for L2GtissueFromSourceId
which in your case it will always be UBERON_0001969
Knowing what protein isoforms (e.g. Olink ID or uniprot accession) are nice to have but are not critical. You can freely play with traitFromSource
or even concatenate several ids in a way that they are readable.
As you can see, traitFromSourceMappedIds
is not populated. That's fine because we don't use QTLs to derive Target-Disease/Phenotype evidence, only variant/locus to gene.
@d0choa @addramir Test runs revealed two further problems with the data. One is very easy and the other is quite complex, so I'll post it as a separate comment.
For all ancestries (I'm now ingesting just Europeans but checked for all), all but one entities in the original metadata table correspond to a summary stats file. Likewise, all summary stats file have a corresponding metadata entity (some more than one — more on that below!)
However, the GLIPR1:P48060:OID31517:v1 entry in the metadata table does not correspond to any summary stats files. There's not much to do other than manually mark and skip it, which I will do.
For some summary stats files (to be exact, 15 of them) the metadata table contains more than one entry.
UKBPPP_ProteinID | olink_target_fullname | OlinkID | UniProt | Assay | Panel | Panel_Lot_Nr | UniProt2 | HGNC.symbol | ensembl_id | chr | gene_start | gene_end | Strand | Dilution_factor | block | expansion |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1 | Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C | OID30707 | P0DUB6_P0DTE7_P0DTE8 | AMY1A_AMY1B_AMY1C | Inflammation_II | B20704 | P0DUB6 | AMY1A | ENSG00000237763 | 1 | 103655760 | 103664554 | 1 | 17:40 | C | yes |
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1 | Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C | OID30707 | P0DUB6_P0DTE7_P0DTE8 | AMY1A_AMY1B_AMY1C | Inflammation_II | B20704 | P0DTE7 | AMY1B | ENSG00000174876 | 1 | 103687415 | 103696454 | -1 | 17:40 | C | yes |
AMY1A_AMY1B_AMY1C:P0DUB6_P0DTE7_P0DTE8:OID30707:v1 | Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C | OID30707 | P0DUB6_P0DTE7_P0DTE8 | AMY1A_AMY1B_AMY1C | Inflammation_II | B20704 | P0DTE8 | AMY1C | ENSG00000187733 | 1 | 103745323 | 103758726 | 1 | 17:40 | C | yes |
My understanding is that, in the assay, they are measuring total alpha amylase, which in reality consists of three subunits coded by three genes, and so they report it this way.
Note that, in the first example, all subunits are on the same chromosome and even roughly in the same region. This is true for all but one examples. But for Interleukin-12, its two subunits are on different chromosomes:
UKBPPP_ProteinID | olink_target_fullname | OlinkID | UniProt | Assay | Panel | Panel_Lot_Nr | UniProt2 | HGNC.symbol | ensembl_id | chr | gene_start | gene_end | Strand | Dilution_factor | block | expansion |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IL12A_IL12B:P29459_P29460:OID21327:v1 | Interleukin-12 | OID21327 | P29459_P29460 | IL12A_IL12B | Oncology | B04412 | P29459 | IL12A | ENSG00000168811 | 3 | 159988835 | 159996019 | 1 | 1:1 | B | no |
IL12A_IL12B:P29459_P29460:OID21327:v1 | Interleukin-12 | OID21327 | P29459_P29460 | IL12A_IL12B | Oncology | B04412 | P29460 | IL12B | ENSG00000113302 | 5 | 159314780 | 159330487 | -1 | 1:1 | B | no |
Alpha-amylase 1A_Alpha-amylase 1B_Alpha-amylase 1C Beta-defensin 103 Beta-defensin 104 Beta-defensin 4A BolA-like protein 2 Cancer/testis antigen 1 Choriogonadotropin subunit beta 3 Creatine kinase U-type, mitochondrial Galactoside 3_5 Galectin-7 Interleukin-12 Interleukin-27 MHC class I polypeptide-related sequence B_A Neutrophil defensin 1 Sperm acrosome-associated protein 5
TBH none of those solutions strike me as very good, so @d0choa @addramir I'd love to hear your opinions on this
I had a brief discussion with @DSuveges and we concluded a solution around #2 is probably the safest bet here.
The rationale is that we don't want to lose gene assignments for the main use of this data (colocalisation/MR). Having replicated signals for multiple genes producing the same gene product is probably safer than losing QTL assignments. The mistake we will be committing is that we don't know if each of these genes is the one being transcribed but, if not, it would be another gene producing the same gene product.
To ensure this doesn't break anything, the best option would be to use a non-redundant studyIndexId
. In that way, you would ensure that they are treated as independent studies (even if they come from the same summary stats). That would be safe downstream, but let us know if this causes any issues with the harmonisation.
The data for all 2 957 studies can be found here: gs://gentropy-tmp/batch/output/ukb_ppp_eur
. Each now has a unique study ID. The data is pre-partitioned by study and chromosome.
I am now working on the final action which is the actual data harmonisation pipeline. I expect to have test runs this week and some results next week.
A lot of the developments and discussions have been happening internally on this issue, so just to provide a status update:
The variant annotation dataset was updated by @DSuveges 2024-04-18 as well. I'm currently investigating the data against the updated dataset, as well as looking into including chromosome X data in addition to autosomes.
This is for comparison/baseline purposes only, I'll post the updated metrics with the new dataset version shortly.
For the v1 and v2 versions of the harmonised UKB PPP data, as discussed with @addramir and @Daniel-Considine, the decision was to harmonise all variant types except for [flip] [snp_c], which clearly display a lack of AF concordance.
@tskir for some reason image plot dosn't work.
As we can see, the metrics are essentially unchange, which is good. This means the new harmonisation pipeline handles indels from the new variant annotation dataset well.
For autosomes, it looks like we are good to go with the updated variant annotation dataset with the same decision re: variant types as before.
@addramir Thank you! Fixed the plots in both comments now
Thank you! All looks good for me.
UKB PPP data likes to spice things up a bit occasionally, so while in the file name the chromosome is called chrX
(for example, discovery_chrX_VWF:P04275:OID20256:v1:Cardiometabolic.gz
), inside the file the value for the chromosome name is actually recorded as 23
. I didn't quite see this coming, so the original quickly prepared harmonised data (v1, v2) did not include chromosome X.
But this actually presented a good opportunity to plot it separately now and to verify that the allele frequencies behave in the same way as for the autosomes, which luckily they are. So for the upcoming v3 data release, I will proceed to add chrX and process it in the same way as the autosomes.
neglog_pvalue_to_mantissa_and_exponent
on the LOG10P column.As for all previous metrics, I'm using UKB_PPP_EUR_A1BG_P04217_OID30771_v1
.
The first three metrics are unchanged because previously, chrX variants were present at these stages and were counted, but they were labeled as chromosome 23, so they didn't go into subsequent stages.
from gentropy.common.session import Session
from gentropy.method.window_based_clumping import WindowBasedClumping
from gentropy.dataset.summary_statistics import SummaryStatistics
from pyspark.sql.functions import col
from gentropy.dataset.study_locus import StudyLocus
session = Session()
%%time
path_sumstats="gs://ukb_ppp_eur_data/summary_stats"
L=SummaryStatistics.from_parquet(session=session,path=path_sumstats)
slt=WindowBasedClumping.clump(L,gwas_significance=1.7e-11,distance=5e5)
slt.df.write.parquet("gs://ukb_ppp_eur_data/clumped")
%%time
sl=StudyLocus.from_parquet(session=session,path="gs://ukb_ppp_eur_data/clumped")
df=sl.df
df = df.filter(~((col("chromosome") == "6") & (col("position").between(25000000, 34000000))))
df.count()
sl.df=df
wbc=sl.annotate_locus_statistics(summary_statistics=L,collect_locus_distance=1000000)
wbc.df.write.parquet("gs://ukb_ppp_eur_data/collected")
Thank you! We just briefly discussed with @d0choa - could you please make a parametrisable step based on you clumping+collection code?
I've now finished uploading all code that I have related to the UKB PPP (EUR) datasource. Currently, three stages are required for the entire processing pipeline.
Notes on data
Storage
Format
CHR:POS:ALLELE0:ALLELE1:imp:v1
)Processing approach
Scope
Harmonisation
ALLELE0 and ALLELE1 do not necessarily correspond to REF and ALT. Hence, we adopt the following algorithm:
Filtering
We will use hard filtering as described below, meaning that variants that don't pass the filters are fully discarded and will not be present in the output
After filtering, we expect to observe anywhere from 13 to 20 million SNPs
Special cases