Closed DSuveges closed 6 months ago
Out of the 29.9k summary statistics harmonized by the GWAS Catalog, the following columns contain sample sizes (with the number of studi es the column was used):
4458 n
2 N
4 n_analyzed
3 n_samples
1 n_total
1 samplesize
1 totalsamplesize
List of all columns used by all summary statistics: sumstats_columns.txt
It seems Finngen doesn't provide variant level sample sizes:
╰─ gsutil cat gs://finngen-public-data-r9/summary_stats/finngen_R9_AB1_ANOGENITAL_HERPES_SIMPLEX.gz | gzcat | head -5 | column -t
#chrom pos ref alt rsids nearest_genes pval mlogp beta sebeta af_alt af_alt_cases af_alt_controls
1 13668 G A rs2691328 OR4F5 0.527885 0.277461 -0.25741 0.407785 0.00583101 0.00550145 0.00583264
1 14773 C T rs878915777 OR4F5 0.297251 0.526876 0.267764 0.256886 0.0135226 0.0146561 0.013517
1 15585 G A rs533630043 OR4F5 0.112603 0.94845 -1.26699 0.798556 0.00111401 0.000560533 0.00111674
1 16549 T C rs1262014613 OR4F5 0.304912 0.515825 -1.32876 1.29515 0.000562811 0.000334372 0.000563941
Raw datasets do contain sample size column, called n_complete_samples
:
╰─ gsutil cat gs://genetics-portal-dev-analysis/hn9/neale_sumstats/100001_raw.neale2.gwas.imputed_v3.both_sexes.GRCh38.tsv.gz | gzcat | head | cut -f-3,9 | column -t
chromosome base_pair_location other_allele n_complete_samples
1 758351 A 51453
1 909894 G 51453
1 933024 C 51453
1 973673 G 51453
1 1037956 A 51453
1 1055604 G 51453
1 1065477 G 51453
1 1065797 G 51453
1 1148447 T 51453
however it seems these columns contain constant values:
╰─ gsutil cat gs://genetics-portal-dev-analysis/hn9/neale_sumstats/100001_raw.neale2.gwas.imputed_v3.both_sexes.GRCh38.tsv.gz | gzcat | head -10000 | cut -f9 | sort | uniq -c
9999 51453
1 n_complete_samples
The same is true:
╰─ gsutil cat gs://genetics-portal-raw/uk_biobank_sumstats/saige_nov2017/raw/PheCode_008_SAIGE_MACge20.txt.gz | gzcat | head | cut -f-3,8-9 |column -t #| less -S
chrom pos snpid num_cases num_controls
1 16071 rs541172944 8991 399970
1 16280 rs866639523 8991 399970
1 49298 rs10399793 8991 399970
1 54353 rs140052487 8991 399970
1 54564 rs558796213 8991 399970
1 54591 rs561234294 8991 399970
1 54676 rs2462492 8991 399970
1 55326 rs3107975 8991 399970
1 55351 rs531766459 8991 399970
Single pair of sample counts in the entire summary statistics:
╰─ gsutil cat gs://genetics-portal-raw/uk_biobank_sumstats/saige_nov2017/raw/PheCode_008_SAIGE_MACge20.txt.gz | gzcat | cut -f8-9 | sort -ur | tail -n+2
8991 399970
Since FinnGen and UKBB-* are not meta-analyses but studies with good imputation, it is expected that they would have a consistent N across SNPs. I suggest adding a sample size column to all studies, even if the sample size remains consistent, in order to prevent format disharmonization. This approach will facilitate uniform usage of this column across all studies in the future.
Additional comments: 1) I think we can remove |-- betaConfidenceIntervalLower: double (nullable = true) and |-- betaConfidenceIntervalUpper: double (nullable = true) columns. 2) standardError should be obligatory (nullable = false)
So the final schema is:
root |-- studyId: string (nullable = false) |-- variantId: string (nullable = false) |-- chromosome: string (nullable = false) |-- position: integer (nullable = false) |-- beta: double (nullable = false) |-- N: integer (nullable = false) |-- effectAlleleFrequencyFromSource: float (nullable = true) |-- standardError: double (nullable = false) |-- pValueMantissa: float (nullable = false) |-- pValueExponent: integer (nullable = false)
@DSuveges @addramir If we remove the confidence intervals from the summary stats, should we drop them from the StudyLocus as well? https://github.com/opentargets/genetics_etl_python/blob/09dd2bc355185d8d2fae5999b0ad8413c22a8735/src/otg/assets/schemas/study_locus.json#L47
I think we should. I'll submit a PR.
Can we close this issue?
This ticket collects requested modification to the summary statistics data module. As multiple functions for different datasources generates summary statistics, it is very important to any updates has to be implemented in every affected place:
Update No1: Adding variant level sample size whenever this detail is available.
Finngen-No sample countUKBB - Neale- Sample count not variant specificUKBB - Saige- Sample count not variant specificUpdate No2: Removing confidence intervals.
As these fields are computable from effect size and p-value, and rarely used in bulk, there's no point in calculating them and store with summary stats. We need to drop the schema and the logic in the ingestion.
Update No3: new filters on single point associations and update in harmonization logic