Making changes in the summary statistics data model and ingestion logic

DSuveges commented 1 year ago

This ticket collects requested modification to the summary statistics data module. As multiple functions for different datasources generates summary statistics, it is very important to any updates has to be implemented in every affected place:

[ ] Data class schema
[ ] GWAS Catalog ingestion
[ ] Finngen ingestion
[ ] UKBB ingestion

Update No1: Adding variant level sample size whenever this detail is available.

[ ] GWAS Catalog - datasets might contain sample size, but as this column is not harmonized, we have to explore available column names.
[x] ~~Finngen~~ -No sample count
[x] ~~UKBB - Neale~~ - Sample count not variant specific
[x] ~~UKBB - Saige~~ - Sample count not variant specific

Update No2: Removing confidence intervals.

As these fields are computable from effect size and p-value, and rarely used in bulk, there's no point in calculating them and store with summary stats. We need to drop the schema and the logic in the ingestion.

Update No3: new filters on single point associations and update in harmonization logic

[x] Drop the associations with beta == 0 (we really can’t do anything).
[ ] For the rest of snps if se==0 and p-value!=1 and p-value!=0 -> infer se from p-value and beta, otherwise drop
[ ] For the rest of snps if p-value==1 or p-value==0 - infer it form beta and se
[ ] If p-value is too small (0) even after inferring -> replace zero with 2e-308

DSuveges commented 1 year ago

Adding variant level sample size whenever this detail is available

GWAS Catalog

Out of the 29.9k summary statistics harmonized by the GWAS Catalog, the following columns contain sample sizes (with the number of studi es the column was used):

   4458 n
      2 N
      4 n_analyzed
      3 n_samples
      1 n_total
      1 samplesize
      1 totalsamplesize

List of all columns used by all summary statistics: sumstats_columns.txt

Finngen

It seems Finngen doesn't provide variant level sample sizes:

╰─ gsutil cat gs://finngen-public-data-r9/summary_stats/finngen_R9_AB1_ANOGENITAL_HERPES_SIMPLEX.gz | gzcat | head -5 | column -t
#chrom  pos    ref  alt  rsids         nearest_genes  pval      mlogp     beta      sebeta    af_alt       af_alt_cases  af_alt_controls
1       13668  G    A    rs2691328     OR4F5          0.527885  0.277461  -0.25741  0.407785  0.00583101   0.00550145    0.00583264
1       14773  C    T    rs878915777   OR4F5          0.297251  0.526876  0.267764  0.256886  0.0135226    0.0146561     0.013517
1       15585  G    A    rs533630043   OR4F5          0.112603  0.94845   -1.26699  0.798556  0.00111401   0.000560533   0.00111674
1       16549  T    C    rs1262014613  OR4F5          0.304912  0.515825  -1.32876  1.29515   0.000562811  0.000334372   0.000563941

UKBB Neale

Raw datasets do contain sample size column, called n_complete_samples:

╰─ gsutil cat gs://genetics-portal-dev-analysis/hn9/neale_sumstats/100001_raw.neale2.gwas.imputed_v3.both_sexes.GRCh38.tsv.gz | gzcat | head | cut -f-3,9 | column -t
chromosome  base_pair_location  other_allele  n_complete_samples
1           758351              A             51453
1           909894              G             51453
1           933024              C             51453
1           973673              G             51453
1           1037956             A             51453
1           1055604             G             51453
1           1065477             G             51453
1           1065797             G             51453
1           1148447             T             51453

however it seems these columns contain constant values:

╰─ gsutil cat gs://genetics-portal-dev-analysis/hn9/neale_sumstats/100001_raw.neale2.gwas.imputed_v3.both_sexes.GRCh38.tsv.gz | gzcat | head -10000 | cut -f9 | sort | uniq -c
9999 51453
   1 n_complete_samples

UKBB - Saige

The same is true:

╰─ gsutil cat gs://genetics-portal-raw/uk_biobank_sumstats/saige_nov2017/raw/PheCode_008_SAIGE_MACge20.txt.gz | gzcat | head | cut -f-3,8-9 |column -t #| less -S
chrom  pos    snpid        num_cases  num_controls
1      16071  rs541172944  8991       399970
1      16280  rs866639523  8991       399970
1      49298  rs10399793   8991       399970
1      54353  rs140052487  8991       399970
1      54564  rs558796213  8991       399970
1      54591  rs561234294  8991       399970
1      54676  rs2462492    8991       399970
1      55326  rs3107975    8991       399970
1      55351  rs531766459  8991       399970

Single pair of sample counts in the entire summary statistics:

╰─ gsutil cat gs://genetics-portal-raw/uk_biobank_sumstats/saige_nov2017/raw/PheCode_008_SAIGE_MACge20.txt.gz | gzcat |  cut -f8-9 | sort -ur | tail -n+2
8991    399970

addramir commented 1 year ago

Since FinnGen and UKBB-* are not meta-analyses but studies with good imputation, it is expected that they would have a consistent N across SNPs. I suggest adding a sample size column to all studies, even if the sample size remains consistent, in order to prevent format disharmonization. This approach will facilitate uniform usage of this column across all studies in the future.

addramir commented 1 year ago

Additional comments: 1) I think we can remove |-- betaConfidenceIntervalLower: double (nullable = true) and |-- betaConfidenceIntervalUpper: double (nullable = true) columns. 2) standardError should be obligatory (nullable = false)

So the final schema is:

ireneisdoomed commented 10 months ago

@DSuveges @addramir If we remove the confidence intervals from the summary stats, should we drop them from the StudyLocus as well? https://github.com/opentargets/genetics_etl_python/blob/09dd2bc355185d8d2fae5999b0ad8413c22a8735/src/otg/assets/schemas/study_locus.json#L47

DSuveges commented 10 months ago

I think we should. I'll submit a PR.

addramir commented 6 months ago

Can we close this issue?

opentargets / issues