opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Decide which deCODE pQTL data we need #3237

Closed tskir closed 3 months ago

tskir commented 3 months ago

As we just discussed, there's a number of files available from deCODE pQTL datasets, some of which I have already downloaded. We need to decide if these files are enough, and if we need anything in addition. In the comments I'll post details on the files I already downloaded, and the ones additionally available.

tskir commented 3 months ago

Files from proteomics (2021 paper), already downloaded

It's this paper: https://pubmed.ncbi.nlm.nih.gov/34857953/ — Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease.

Files, total 4909 (only data files, not counting *md5sum files), about 4 TB:

 953734152  2024-03-05T12:39:35Z  10000_28_CRYBB2_CRBB2.txt.gz
 952601626  2024-03-05T12:34:28Z  10001_7_RAF1_c_Raf.txt.gz
 952887592  2024-03-05T12:34:21Z  10003_15_ZNF41_ZNF41.txt.gz

Contents:

Chrom   Pos     Name    rsids   effectAllele    otherAllele     Beta    Pval    minus_log10_pval        SE      N       ImpMAF
chr1    23597   chr1:23597:G:A  NA      G       A       0.6616  0.157677        0.80223 0.468248        35275   0.00011
chr1    92875   chr1:92875:C:T  rs193157612     C       T       0.6019  0.183475        0.73642 0.452512        35275   0.00012
chr1    107682  chr1:107682:G:C rs879827054     G       C       0.8227  0.176931        0.75220 0.609288        35275   0.00011
tskir commented 3 months ago

Files from proteomics2023 (2023 paper) — not yet downloaded

It's this paper: https://www.nature.com/articles/s41586-023-06563-x#data-availability — Grímur Hjörleifsson Eldjarn, Egil Ferkingstad et al. Large-scale plasma proteomics comparisons through genetics and disease associations

Files are all stored in the same directory, but they have different prefixes, so the following sections are organised accordingly.

OLINK files

Filenames:

GBR_UKB_Africa_OLINK_OID20049_NPPB_Natriuretic_peptides_B_adjAgeSexBatPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK_OID20050_TNNI3_Troponin_I_adjAgeSexBatPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK_OID20051_HNRNPK_Heterogeneous_nuclear_ribonucleoprotein_K_adjAgeSexBatPC_InvNorm.txt.gz

Contents:

Chrom   Pos Name    rsids   A1  A0  Beta    Pval    minus_log10_pval    SE  N   ImpFreqA1
chr1    586325  chr1:586325:G:T rs907825527 G   T   NaN NaN NaN NaN 1443    0.00067910431
chr1    586338  chr1:586338:G:T rs879832726 G   T   NaN NaN NaN NaN 1443    0.00052125026
chr1    586844  chr1:586844:A:G rs564572333 A   G   NaN 0.472039    0.32602 NaN 1443    0.014481348

OLINK2 files

Filenames:

GBR_UKB_Africa_OLINK2_OID30049_RTKN2_Rhotekin_2_adjAgeSexPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK2_OID30050_DENND2B_DENN_domain_containing_protein_2B_adjAgeSexPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK2_OID30051_BHMT2_S_methylmethionine_homocysteine_S_methyltransferase_BHMT2_adjAgeSexPC_InvNorm.txt.gz

Contents:

Chrom   Pos Name    rsids   A1  A0  Beta    Pval    minus_log10_pval    SE  N   ImpFreqA1
chr1    586325  chr1:586325:G:T rs907825527 G   T   NaN NaN NaN NaN 1059    0.00067910431
chr1    586338  chr1:586338:G:T rs879832726 G   T   NaN NaN NaN NaN 1059    0.00052125026
chr1    586844  chr1:586844:A:G rs564572333 A   G   NaN 0.154800    0.81023 NaN 1059    0.014481348

Proteomics_PC0 files (5284 total)

Filenames:

Proteomics_PC0_10000_28_CRYBB2_CRBB2.txt.gz
Proteomics_PC0_10001_7_RAF1_c_Raf.txt.gz
Proteomics_PC0_10003_15_ZNF41_ZNF41.txt.gz

Contents:

Chrom   Pos Name    rsids   A1  A0  Beta    Pval    minus_log10_pval    SE  N   ImpFreqA1
chr1    152835  chr1:152835:A:T rs1446209547    A   T   0.4592  0.278012    0.55594 0.423305    35896   0.000101105
chr1    201430  chr1:201430:TTC:T   rs1178382200    TTC T   0.0065  0.670179    0.17381 0.015262    35896   0.0766199
chr1    455948  chr1:455948:G:C rs1363653182    G   C   -0.2066 0.511845    0.29086 0.314955    35896   0.00012111

Proteomics_SMP (5284 total)

Filenames:

Proteomics_SMP_PC0_10000_28_CRYBB2_CRBB2.txt.gz
Proteomics_SMP_PC0_10001_7_RAF1_c_Raf.txt.gz
Proteomics_SMP_PC0_10003_15_ZNF41_ZNF41.txt.gz

Contents:

Chrom   Pos Name    rsids   A1  A0  Beta    Pval    minus_log10_pval    SE  N   ImpFreqA1
chr1    152835  chr1:152835:A:T rs1446209547    A   T   0.4448  0.286502    0.54287 0.417329    35652   0.000101105
chr1    201430  chr1:201430:TTC:T   rs1178382200    TTC T   0.0211  0.162654    0.78874 0.015112    35652   0.0766199
chr1    455948  chr1:455948:G:C rs1363653182    G   C   -0.3007 0.332963    0.47760 0.310588    35652   0.00012111
tskir commented 3 months ago

Discussed with @addramir, decided that we want the entire proteomics2023 dataset, too. Will mirror.

tskir commented 3 months ago

@addramir @d0choa @DSuveges

Summary of completed work

✅ Original dataset (proteomics)

This was mirrored previously, but I have now amended the protocol to calculate and compare MD5 sums. All matched, so the dataset can be considered fully mirrored and checked.

✅ New dataset discussed in this issue (proteomics2023)

This encountered a lot more problems than the previous one, as is reflected in the protocol, but now all of them are fixed. This dataset can also be considered fully mirrored and checked.

:radioactive: For posterity: list of issues that had to be solved

  1. For proteomics, the list of download links is malformed as it contains all links for proteomics + broken links for proteomics2023.
  2. aria2, which is recommended by deCODE to bulk download the files, plainly does not work because it's not compatible with the HTTP server version they use. Fixed by using curl + parallel + custom request retry logic.
  3. For proteomics2023, its more than 38,000 files come from several categories (such as different cohorts and methods), but the links didn't reflect this structure. I've separated the files into subdirectories based on their prefix, which is now reflected in the bucket structure.
  4. The files are compressed, however, the MD5 sums provided by deCODE are for uncompressed data. This is accounted for in the MD5 checking process.
  5. For exactly 1,000 files of proteomics2023, MD5 sums provided by deCODE are all the same, corresponding to an empty byte stream. For those files, since we can't fully check MD5, I've checked that they represent valid GZIP archives.
  6. The deCODE server occasionally has outages where all requests are rejected for about an hour. Solved by improving custom retry logic.
  7. Very rarely, the deCODE server returns a successful reply but an empty response. Solved by manually retrieving and checking the single affected file.

Phew.