Closed tskir closed 3 months ago
proteomics
(2021 paper), already downloadedIt's this paper: https://pubmed.ncbi.nlm.nih.gov/34857953/ — Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease.
Files, total 4909 (only data files, not counting *md5sum files), about 4 TB:
953734152 2024-03-05T12:39:35Z 10000_28_CRYBB2_CRBB2.txt.gz
952601626 2024-03-05T12:34:28Z 10001_7_RAF1_c_Raf.txt.gz
952887592 2024-03-05T12:34:21Z 10003_15_ZNF41_ZNF41.txt.gz
Contents:
Chrom Pos Name rsids effectAllele otherAllele Beta Pval minus_log10_pval SE N ImpMAF
chr1 23597 chr1:23597:G:A NA G A 0.6616 0.157677 0.80223 0.468248 35275 0.00011
chr1 92875 chr1:92875:C:T rs193157612 C T 0.6019 0.183475 0.73642 0.452512 35275 0.00012
chr1 107682 chr1:107682:G:C rs879827054 G C 0.8227 0.176931 0.75220 0.609288 35275 0.00011
proteomics2023
(2023 paper) — not yet downloadedIt's this paper: https://www.nature.com/articles/s41586-023-06563-x#data-availability — Grímur Hjörleifsson Eldjarn, Egil Ferkingstad et al. Large-scale plasma proteomics comparisons through genetics and disease associations
Files are all stored in the same directory, but they have different prefixes, so the following sections are organised accordingly.
Filenames:
GBR_UKB_Africa_OLINK_OID20049_NPPB_Natriuretic_peptides_B_adjAgeSexBatPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK_OID20050_TNNI3_Troponin_I_adjAgeSexBatPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK_OID20051_HNRNPK_Heterogeneous_nuclear_ribonucleoprotein_K_adjAgeSexBatPC_InvNorm.txt.gz
Contents:
Chrom Pos Name rsids A1 A0 Beta Pval minus_log10_pval SE N ImpFreqA1
chr1 586325 chr1:586325:G:T rs907825527 G T NaN NaN NaN NaN 1443 0.00067910431
chr1 586338 chr1:586338:G:T rs879832726 G T NaN NaN NaN NaN 1443 0.00052125026
chr1 586844 chr1:586844:A:G rs564572333 A G NaN 0.472039 0.32602 NaN 1443 0.014481348
Filenames:
GBR_UKB_Africa_OLINK2_OID30049_RTKN2_Rhotekin_2_adjAgeSexPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK2_OID30050_DENND2B_DENN_domain_containing_protein_2B_adjAgeSexPC_InvNorm.txt.gz
GBR_UKB_Africa_OLINK2_OID30051_BHMT2_S_methylmethionine_homocysteine_S_methyltransferase_BHMT2_adjAgeSexPC_InvNorm.txt.gz
Contents:
Chrom Pos Name rsids A1 A0 Beta Pval minus_log10_pval SE N ImpFreqA1
chr1 586325 chr1:586325:G:T rs907825527 G T NaN NaN NaN NaN 1059 0.00067910431
chr1 586338 chr1:586338:G:T rs879832726 G T NaN NaN NaN NaN 1059 0.00052125026
chr1 586844 chr1:586844:A:G rs564572333 A G NaN 0.154800 0.81023 NaN 1059 0.014481348
Filenames:
Proteomics_PC0_10000_28_CRYBB2_CRBB2.txt.gz
Proteomics_PC0_10001_7_RAF1_c_Raf.txt.gz
Proteomics_PC0_10003_15_ZNF41_ZNF41.txt.gz
Contents:
Chrom Pos Name rsids A1 A0 Beta Pval minus_log10_pval SE N ImpFreqA1
chr1 152835 chr1:152835:A:T rs1446209547 A T 0.4592 0.278012 0.55594 0.423305 35896 0.000101105
chr1 201430 chr1:201430:TTC:T rs1178382200 TTC T 0.0065 0.670179 0.17381 0.015262 35896 0.0766199
chr1 455948 chr1:455948:G:C rs1363653182 G C -0.2066 0.511845 0.29086 0.314955 35896 0.00012111
Filenames:
Proteomics_SMP_PC0_10000_28_CRYBB2_CRBB2.txt.gz
Proteomics_SMP_PC0_10001_7_RAF1_c_Raf.txt.gz
Proteomics_SMP_PC0_10003_15_ZNF41_ZNF41.txt.gz
Contents:
Chrom Pos Name rsids A1 A0 Beta Pval minus_log10_pval SE N ImpFreqA1
chr1 152835 chr1:152835:A:T rs1446209547 A T 0.4448 0.286502 0.54287 0.417329 35652 0.000101105
chr1 201430 chr1:201430:TTC:T rs1178382200 TTC T 0.0211 0.162654 0.78874 0.015112 35652 0.0766199
chr1 455948 chr1:455948:G:C rs1363653182 G C -0.3007 0.332963 0.47760 0.310588 35652 0.00012111
Discussed with @addramir, decided that we want the entire proteomics2023
dataset, too. Will mirror.
@addramir @d0choa @DSuveges
proteomics
)This was mirrored previously, but I have now amended the protocol to calculate and compare MD5 sums. All matched, so the dataset can be considered fully mirrored and checked.
proteomics2023
)This encountered a lot more problems than the previous one, as is reflected in the protocol, but now all of them are fixed. This dataset can also be considered fully mirrored and checked.
proteomics
, the list of download links is malformed as it contains all links for proteomics
+ broken links for proteomics2023
.proteomics2023
, its more than 38,000 files come from several categories (such as different cohorts and methods), but the links didn't reflect this structure. I've separated the files into subdirectories based on their prefix, which is now reflected in the bucket structure.proteomics2023
, MD5 sums provided by deCODE are all the same, corresponding to an empty byte stream. For those files, since we can't fully check MD5, I've checked that they represent valid GZIP archives.Phew.
As we just discussed, there's a number of files available from deCODE pQTL datasets, some of which I have already downloaded. We need to decide if these files are enough, and if we need anything in addition. In the comments I'll post details on the files I already downloaded, and the ones additionally available.