Decisions around summary stats ingestion

DSuveges commented 2 years ago

We have just realised that the rules to ingest summary stats in the pipeline has not been implemented. The README of the sumstats ingestion says:

# Edit metadata table manually (!) to
# 1. Flag which studies should be imported in the "to_ingest" column (currently, only studies that are European)
# 2. Check total sample size, case count for case-control studies
# 3. Save as configs/gwas_metadata_curated.latest.tsv

To decrease manual intervention needed for operating the release pipeline, the classification of studies needs to be fully automated. However the rules are not yet solidified. We need decisions in the following points:

[ ] What ancestries are we accepting for ingestion? Do we have exclude or include list?
[ ] What about studies with mixed ancestries? What if 90% of the samples are European, are we still dropping the study?
[ ] What is the min case count? Min control count? What about sample sizes of continous studies? What about seriously imbalanced studies, where the case hundreds times smaller than control? Do we still ingest?
[ ] For studies that don't qualify for finemapping due to foreign ancestires, the summary stats should or shouldn't be ingested? Even top loci are not identified from the summary stats, it still can be used as a source for the PheWAS plot.

@MayaGhoussaini , what are your views in these questions?

Jeremy37 commented 2 years ago

Commenting in case it's helpful (I described some of the above steps for manually checking study ancestry)...

My heuristic was that if more than 5% of the study sample was non-EUR then I manually excluded the study from sumstat ingestion. It's not completely trivial to define the ancestry composition because the info from GWAS catalog on this is just a text field - e.g. it might say "170,911 European ancestry individuals", or it might say "99 Italian ancestry cases, 359 Italian ancestry smoker controls". A few rules could capture most cases, but there will probably be examples that fail to map in an automated pipeline. In addition, I think that applying fine-mapping as we do now with a UKB ref panel for general EUR samples is questionable, and my impression is that the field is moving away from this - i.e. to say that fine-mapping with out-of-sample LD is dangerous (will give you false associations).
We should probably have a minimum total sample size limit, but it's hard to know what it should be. An alternative would be to compute a genomic inflation factor and exclude studies based on that, which would be quite straightforward. This could automatically exclude some problematic studies, such as those with highly imbalanced case/control ratio which don't adjust for it properly. But if the study has adjusted for case/control imbalance, then there's no reason in principle why it should be excluded.
This is a good point. Our pipelines currently take all sumstats that are ingested to run fine-mapping on. I.e. we don't distinguish sumstat studies that should only go into PheWAS from those that should have fine-mapping/coloc. It would be good to ingest sumstats from studies with different ancestries, but not to run fine-mapping on those.

MayaGhoussaini commented 2 years ago

Hi @DSuveges, I can see that Jeremy already responded to the questions and I agree with all his reasoning. Thanks Jeremy! This could only change if we suddenly get hold of non-European reference panels that will enable us to perform downstream work on GWAS (i.e. FM). Giving that you raised comment number 3, I think we should aim to ingest all sumstats (including mixed ancestries&non-Europeans). These will go into PheWAS but not down the fine-mapping/coloc path. Happy to discuss this more in person if needed

MayaGhoussaini commented 2 years ago

Another important note that is not directly related but would be good to have it in the new layout of the portal is to have the curated studies from GWAS Cat also feeding into PheWAS.

DSuveges commented 2 years ago

Hi @Jeremy37 , Thank for the insights! This is indeed a hard nut to crack.

I was planning to use the downloadable ancestry file from the GWAS Catalog, that has a more structured way to annotate ancestries (when available form publication to curate). The data looks like this:

+---------------+-------+--------------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+
|STUDY ACCESSION|STAGE  |NUMBER OF INDIVDUALS|BROAD ANCESTRAL CATEGORY                                         |COUNTRY OF RECRUITMENT                                                     |
+---------------+-------+--------------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+
|GCST90017131   |initial|14306               |European                                                         |Canada, Netherlands, Sweden, U.S., Belgium, Finland, Denmark, U.K., Germany|
|GCST90017131   |initial|811                 |East Asian                                                       |Republic of Korea                                                          |
|GCST90017131   |initial|1097                |Hispanic or Latin American                                       |U.S.                                                                       |
|GCST90017131   |initial|114                 |African American or Afro-Caribbean                               |U.S.                                                                       |
|GCST90017131   |initial|481                 |Greater Middle Eastern (Middle Eastern, North African or Persian)|Israel                                                                     |
|GCST90017131   |initial|1531                |NR, Other admixed ancestry                                       |Canada, Netherlands                                                        |
+---------------+-------+--------------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+

While the initial sample description of the same study looks like this:

14,306 European ancestry individuals, 811 Korean ancestry individuals, 1,097 Hispanic individuals, 114 African American individuals, 481 Middle Eastern ancestry individuals, 1,531 unknown and other admixed ancestry individuals`

So curators, irrespectively the country of recruitment, assign the broader ancestry category. This value is always filled. The caveat is that these categories are only split if the authors provide granular description and sample sizes. Unless such cases can happen:

-RECORD 0----------------------------------------------------------------------------------------------------------
 STUDY ACCESSION                | GCST003500                                                                       
 PUBMEDID                       | 27114598                                                                         
 FIRST AUTHOR                   | Liu C                                                                            
 DATE                           | 2016-04-25                                                                       
 INITIAL SAMPLE DESCRIPTION     | 71 cases with pancreatitis, 142 cases without pancreatitis                       
 REPLICATION SAMPLE DESCRIPTION | NA                                                                               
 STAGE                          | initial                                                                          
 NUMBER OF INDIVDUALS           | 213                                                                              
 BROAD ANCESTRAL CATEGORY       | NR, European, Hispanic or Latin American, African unspecified, Asian unspecified 
 COUNTRY OF ORIGIN              | NR                                                                               
 COUNTRY OF RECRUITMENT         | U.S.                                                                             
 ADDITONAL ANCESTRY DESCRIPTION | null

DSuveges commented 2 years ago

If we treat the BROAD ANCESTRAL CATEGORY as a comma separated list of ancestries the situation is manageable.

DSuveges commented 2 years ago

So the plan is:

Upon processing study level metadata, capturing sample descriptions, case/control numbers as granular as possible to enable downstream decision making at the relevant step of the pipeline.
This metadata can tell to drop sumstats based on low sample sizes or other parameters (parameter defined in the config).
Otherwise, if there's a harmonized summary statistics dataset on the gwas catalog ftp, we'll try to ingest it. No exclusion based on ancestry at this stage.
However some filters can also applied upon ingestion, that renders a summary statistics un-eligible: if certain fields are missing (eg. effect). Also we can decide to implement some quality metrics to detect uncontrolled inflation in the data.
Upon ingestion some of the missing values eg. SE can be computed based p-value and effect size. (it's already being done in v2d)
Downstream to the summary statistics ingestion, the finemapping step should be responsible to make decision on which studies should be fine mapped. The logic should take into account the ancestry, required statistical fields etc. These criteria will be defined later.

d0choa commented 1 year ago

@DSuveges could you store this info somewhere else and close the ticket? I think the discussion phase is over

DSuveges commented 1 year ago

We can consider this effort closed, implementation is covered under #2892.

opentargets / issues

Decisions around summary stats ingestion #2733