Ethnicity - Githubissues

heikkil commented 4 years ago

Like discussed in last night's GA4GH call, I'd like to propose opening up discussion on how to include ethnicity in phenopackets 2.0.

My impression from the discussion was that simple enumeration of values would not be enough. The problem is too complex. My opinion is that enumeration is great for simple, clear cut cases, but for complex issue the task of finding the solution should be moved outside phenopackets, i.e. to allow the use of ontology IDs.

heikkil commented 4 years ago

I dug up a 20 years old historical reference to ethnicity in databases: https://web.archive.org/web/20000830140751/http://www.ebi.ac.uk/mutations/recommendations/population.html

Hopefully we can do better now. ;)

julesjacobsen commented 4 years ago

This was a topic of the GA4GH Pedigree working group and alas we didn't make great progress apart from acknowledging that nothing really worked. There was a general agreement that the HANCESTRO and H3 Africa terminologies were the best suited, depending on your use-case.

pnrobinson commented 3 years ago

Would

repeated OntologyTerm population_group

be appropriate -- there, users could add ontology terms eg from here (https://www.ebi.ac.uk/ols/ontologies/ncit/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FNCIT_C17005&viewMode=All&siblings=false) representing race/ethnicity?

heikkil commented 3 years ago

It could be one of the recommended ontologies to use although there is a danger to confuse people as it really is a mixed bag of all kind of groupings

heikkil commented 3 years ago

There is very little legal basis or international agreements on defining population groups in wider society outside genetics. However, I found this:

Understanding the Indigenous and Tribal People Convention, 1989 (No. 169). Handbook for ILO Tripartite Constituents / International Labour standards Department. International Labour Organization. – Geneva, 2013. ISBN 978-92-2-126243-5 https://www.ilo.org/wcmsp5/groups/public/---ed_norm/---normes/documents/publication/wcms_205225.pdf

It seems it is the only reasonable international take on ethnicity. In this convention, the definition of tribal people is given two criteria, objective and subjective, but the defining principle of belonging to a one is only self-identification.

Since genetics does not have anything to do with this most widely applicable definition of ethnicity, we need two concepts:

A subjective, self-declared criteria
A combination of all objective criteria that can be used to estimate genetic ancestry

The terms could come from same ontologies, but the criteria of choosing needs to be different - in analogy to sex and gender. The words used could be, for example, "ethnicity" and "population", but the distinction between them should be clearly spelled out in the definition.

pnrobinson commented 3 years ago

@heikkil the NCIT terms are only intended to be an example of a typical use case. The phenopacket standard does not require that any particular ontology be used for any of the slots, but for the development and documentation, it is good to show at least one example that would work for at least some of the likely use cases. The topic of ethnicity and genetic ancestry is of course very complicated and I think a detailed treatment is outside of the scope of phenopackets. It is important for some genetic analysis and so perhaps this is something for a future GA4GH workgroup.

julesjacobsen commented 3 years ago

Can we defer this to the pedigree working group? They already had a spirited discussion about this and have some guidance in their document https://docs.google.com/document/d/1UAtSLBEQ_7ePRLvDPRpoFpiXnl6VQEJXL2eQByEmfGY

Several REA ontologies exist, each designed with a specific need and philosophy. Data collection tools should allow for multiple values. Many sites will be bound by local requirements. (See more on this topic in A.3 below.)

The Human Ancestry Ontology (HANCESTRO) provides a systematic description of the ancestry concepts used in the NHGRI-EBI Catalog of published genome-wide association studies.

The ClinGen Ancestry & Diversity Working Group is developing standards and guidelines for clinical genetics about the interpretation, collection, and use of REA.

To quote the current state of appendix A.3

A.3. Issues on capturing Race, Ethnicity, or Ancestry

We cannot make a recommendation on how to use Race, Ethnicity, or Ancestry at this time. After speaking with several organizations and experts on this question, we learned there exists differing needs and use cases that have resulted in multiple REA ontologies. At least one ontology uses an ethnicity-language combination in identifying ethnolinguistic tribal affiliation. Identification codes for the various ontologies should be created. Another helpful FHIR resource could be Extension.extension:Source, which could capture in what way the ancestry was reported (e.g. Patient reported, Genetic test/markers, Other, Unknown).

A recent paper published in The American Journal of Human Genetics titled Clinical genetics lacks standard definitions and protocols for the collection and use of diversity measures states ”... there was no consensus on the relevance of REA, including how each of these measures should be used in different scenarios and what information they can convey in the context of human genetics. A lack of common definitions and applications of REA across the precision medicine pipeline may contribute to inconsistencies in data collection, missing or inaccurate classifications, and misleading or inconclusive results.”

Many in the research world use gnomAD and ancestry informative markers (AIM) to infer ancestry. The Genome Aggregation Database (gnomAD) Consortium sponsored by the publication Nature, and the Broad Institute gnomAD browser includes exomes and genomes from European, Latino African and African American, South Asian, East Asian, Ashkenazi Jewish and other populations. They assigned ancestry to all samples for which the probability of that ancestry is > 90% according to the random forest model. All other samples were assigned the other ancestry (oth).

Concerning the use of AIM, according to the National Human Genome Research Institute “Ancestry informative markers refers to locations in the genome that have varied sequences at that location and the relative abundance of those markers differs based on the continent from which individuals can trace their ancestry. So by using a series of these ancestry informative markers, sometimes 20 or 30 or more, and genotyping an individual you can determine from the frequency of those markers where their great, great, great, great ancestors may have come from. These are generally resolved to the three major continents: Africa, Asia, and Europe.”

Also see -

US Federal Drug Administration (FDA, contains nonbinding recommendations) Collection of Race and Ethnicity Data in Clinical Trials

US Centers For Disease Control and Prevention (CDC) Race Category Value Set

US Core Implementation Guide (v3.1.1: STU 3) based on FHIR Release 4 Detailed Race Value Set and Race & Ethnicity - CDC (mostly a list of countries and hundreds of US native indian tribes).

julesjacobsen commented 3 years ago

@heikkil I'm going to close this request for now as having discussed this with the group this evening we agreed that there are too many issues around this to know how to usefully incorporate this into the schema right now. I'd suggest we re-visit this once the ClinGen Ancestry & Diversity Working Group have come up with some recommendations on how to represent this for use in genomic analysis.

julesjacobsen commented 2 years ago

Re-opening given discussion with Pedigree group on 2022-03-03. This field is useful/required for many test ordering systems. The codes used will likely be national-specific codes and not necessarily universally applicable.

peupeubangbang commented 2 years ago

Hi @julesjacobsen just picking up this thread with some use cases in the Australian context for having this info in a phenopacket.

We use ancestry as a term in our research questionnaires in studies looking at 'normal' phenotypic facial variation (ie not to do with a syndrome) in 3D images so we can generate phenotypic reference ranges to ensure we account for normal phenotypic differences between diverse people. We are working on a phenopacket export option for our 3D analysis software since this info would be better stored in a phenopacket anyway since measurements outside of normal range will spit out HPO terms. I would also like to be able to add ancestry/ethnicity info to this phenopacket export option.
Identifying Indigenous Australians in genomic datasets is very important as we have a lack of GWAS data for Indigenous Australians as well as suitable genomic reference ranges so identifying pathogenic variants/genes are more difficult. Also, ancestry/ethnicity information is not always available to our research partners nor are we always permitted to share to the lab if its research based outside of our hospital networks/health system. So being able to push ancestry/ethnicity across from the clinic to the research lab or external sequencing provider along with everything else in a phenopacket would be very useful in our goal to improve genomic healthcare for Indigenous Australians.

phenopackets / phenopacket-schema

Ethnicity #231