Closed DSuveges closed 1 year ago
Commenting in case it's helpful (I described some of the above steps for manually checking study ancestry)...
My heuristic was that if more than 5% of the study sample was non-EUR then I manually excluded the study from sumstat ingestion. It's not completely trivial to define the ancestry composition because the info from GWAS catalog on this is just a text field - e.g. it might say "170,911 European ancestry individuals", or it might say "99 Italian ancestry cases, 359 Italian ancestry smoker controls". A few rules could capture most cases, but there will probably be examples that fail to map in an automated pipeline. In addition, I think that applying fine-mapping as we do now with a UKB ref panel for general EUR samples is questionable, and my impression is that the field is moving away from this - i.e. to say that fine-mapping with out-of-sample LD is dangerous (will give you false associations).
We should probably have a minimum total sample size limit, but it's hard to know what it should be. An alternative would be to compute a genomic inflation factor and exclude studies based on that, which would be quite straightforward. This could automatically exclude some problematic studies, such as those with highly imbalanced case/control ratio which don't adjust for it properly. But if the study has adjusted for case/control imbalance, then there's no reason in principle why it should be excluded.
This is a good point. Our pipelines currently take all sumstats that are ingested to run fine-mapping on. I.e. we don't distinguish sumstat studies that should only go into PheWAS from those that should have fine-mapping/coloc. It would be good to ingest sumstats from studies with different ancestries, but not to run fine-mapping on those.
Hi @DSuveges, I can see that Jeremy already responded to the questions and I agree with all his reasoning. Thanks Jeremy! This could only change if we suddenly get hold of non-European reference panels that will enable us to perform downstream work on GWAS (i.e. FM). Giving that you raised comment number 3, I think we should aim to ingest all sumstats (including mixed ancestries&non-Europeans). These will go into PheWAS but not down the fine-mapping/coloc path. Happy to discuss this more in person if needed
Another important note that is not directly related but would be good to have it in the new layout of the portal is to have the curated studies from GWAS Cat also feeding into PheWAS.
Hi @Jeremy37 , Thank for the insights! This is indeed a hard nut to crack.
I was planning to use the downloadable ancestry file from the GWAS Catalog, that has a more structured way to annotate ancestries (when available form publication to curate). The data looks like this:
+---------------+-------+--------------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+
|STUDY ACCESSION|STAGE |NUMBER OF INDIVDUALS|BROAD ANCESTRAL CATEGORY |COUNTRY OF RECRUITMENT |
+---------------+-------+--------------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+
|GCST90017131 |initial|14306 |European |Canada, Netherlands, Sweden, U.S., Belgium, Finland, Denmark, U.K., Germany|
|GCST90017131 |initial|811 |East Asian |Republic of Korea |
|GCST90017131 |initial|1097 |Hispanic or Latin American |U.S. |
|GCST90017131 |initial|114 |African American or Afro-Caribbean |U.S. |
|GCST90017131 |initial|481 |Greater Middle Eastern (Middle Eastern, North African or Persian)|Israel |
|GCST90017131 |initial|1531 |NR, Other admixed ancestry |Canada, Netherlands |
+---------------+-------+--------------------+-----------------------------------------------------------------+---------------------------------------------------------------------------+
While the initial sample description of the same study looks like this:
14,306 European ancestry individuals, 811 Korean ancestry individuals, 1,097 Hispanic individuals, 114 African American individuals, 481 Middle Eastern ancestry individuals, 1,531 unknown and other admixed ancestry individuals`
So curators, irrespectively the country of recruitment, assign the broader ancestry category. This value is always filled. The caveat is that these categories are only split if the authors provide granular description and sample sizes. Unless such cases can happen:
-RECORD 0----------------------------------------------------------------------------------------------------------
STUDY ACCESSION | GCST003500
PUBMEDID | 27114598
FIRST AUTHOR | Liu C
DATE | 2016-04-25
INITIAL SAMPLE DESCRIPTION | 71 cases with pancreatitis, 142 cases without pancreatitis
REPLICATION SAMPLE DESCRIPTION | NA
STAGE | initial
NUMBER OF INDIVDUALS | 213
BROAD ANCESTRAL CATEGORY | NR, European, Hispanic or Latin American, African unspecified, Asian unspecified
COUNTRY OF ORIGIN | NR
COUNTRY OF RECRUITMENT | U.S.
ADDITONAL ANCESTRY DESCRIPTION | null
If we treat the BROAD ANCESTRAL CATEGORY
as a comma separated list of ancestries the situation is manageable.
So the plan is:
@DSuveges could you store this info somewhere else and close the ticket? I think the discussion phase is over
We can consider this effort closed, implementation is covered under #2892.
We have just realised that the rules to ingest summary stats in the pipeline has not been implemented. The README of the sumstats ingestion says:
To decrease manual intervention needed for operating the release pipeline, the classification of studies needs to be fully automated. However the rules are not yet solidified. We need decisions in the following points:
@MayaGhoussaini , what are your views in these questions?