greenelab / iscb-diversity-manuscript

Analysis of ISCB Fellows and Keynotes Reveals Disparities
https://greenelab.github.io/iscb-diversity-manuscript/
Other
5 stars 6 forks source link

Confusing nationality with citizenship and religion #27

Open idoerg opened 4 years ago

idoerg commented 4 years ago

Seems like Figure 2 confounds citizenship with religion and nationality. Citizenship is a pretty clear term: there is a fairly straightforward legal definition of what citizenship is in each country.

Nationality os more vague: in the US, it is often confused with citizenship. But actually in the US, a US national may not be a US citizen.

In other countries , there are legal or common-law definition of nationality. They vary, and they may not be post-enlightenment textbook history definitions. Many people identify themselves with their nationality first, and their citizenship second. In countries where a nationality equals a minority or majority equity issue, you may be missing out on a lot of equity issues this paper is supposed to highlight.

Celtic English: an ancestry, at best.

European: regional definition, losing considerable nuances of ethnicity, race, and nationality.

Hispanic: In the US this has discriminated minority connotations, but this can include a variety of people, including hispanic names that are common in former Spanish colnoies in Africa?

East Asian again, like European, a grab-all bag that does not really
Muslim a religion overlaps with all of the above (and below)

South Asian: again: Muslim names from this region, that includes the largest Muslim population in the world, would go to “Muslim”.

African: subsaharan africa is probably the most diverse region on earth -- genetically as well as ethnically -- lumped in one category.

Israeli: names in the example are all of Israeli Jews mostly of certain disaspora origins. Israelis named Muhammad, Sergey, Adisu would go to the Muslim, European, and African categories, respectively.

Bottom line: not sure what to do, but don’t call it “nationality”. Perhaps “Rough historical name groupings”.

trangdata commented 4 years ago

Thank you for opening the issue! I agree that some regional names we used here to mean a group of countries may lead to confusion (definitely in the case of Muslim), but I do want to clarify that we used similar terminology as described in this paper. Figure 5 shows their 39-leaf nationality taxonomy similar to what we used as our categories. Our specific country to region mapping can be found in this online file country_to_region.tsv.

@arielah Should we write "Jewish" here instead of "Israeli"?

Nonetheless, I agree with you that "nationality" may be a confusing term. Perhaps "Estimation of name origins" would be more proper?

dhimmel commented 4 years ago

Thanks @idoerg for the feedback. I wanted to embed the figure from http://www.name-prism.com/about that we based these categories on:

image

Here is the portion of the manuscript describing how we extracted a country for each living person on Wikipedia:

To generate a training dataset for nationality prediction, we scraped the English Wikipedia’s category of Living People, which contained approximately 930,000 pages at the time of processing in November 2019. This category reflects a modern naming landscape. It is regularly curated and allowed us to avoid pages related to non-persons. For each Wikipedia page, we used two strategies to find a full birth name and nationality for that person. First, we used information from the personal details sidebar; the information in this sidebar varied widely but often contained a full name and a place of birth. Second, in the body of the text of most English-language biographical Wikipedia pages, the first sentence usually begins with, for example, “John Edward Smith (born 1 January 1970) is an American novelist known for …” We used regular expressions to parse out the person’s name from this structure and checked that the expression after “is a” matched a list of possible nationalities.

So I bolded the two strategies. The first detects place of birth for a name. The second seems to detect nationality/citizenship (or whatever Wikipedia curators consider to be the person's primary country). @arielah is that correct?

We are not tied to using the Name-Prism region hierarchy. So if there is a better way to group countries, we could consider that. We can also consider changing our terminology. @cgreene maybe we can collect additional feedback and take a bit of time to research this topic further.And @idoerg, of course, any additional feedback you provide is greatly appreciated!

I think that we want to stay with an approach that is based on inferring countries from names, because that is what the Wikipedia dataset supports. So we should update our language and analyses as needed to reflect this.

Nonetheless, I agree with you that "nationality" may be a confusing term. Perhaps "Estimation of name origins" would be more proper?

I like "Estimation of name origins", but am not really an expert on whether that would also have misleading connotations.

idoerg commented 4 years ago

I think that we want to stay with an approach that is based on inferring countries from names, because that is what the Wikipedia dataset supports.

The problem here is threefold: (1) confusing citizenship (a legal concept) with nationality (a mostly social concept, and overloaded with different, sometimes contradictory meanings in different, um, nationalities) (2) inferring citizenship from name which is problematic at best, especially in countries with large immigrant populations and/or large ethnic diversity (India is has the third largest Muslim population in the world, yet it is not a Muslim majority country), and (3) doing so with scientists who tend to be have a large representation of immigrants /expats. E.g. Many of the Muslim and Israeli names you put up there are American citizens / residents. I don't think the dataset supports that.

@arielah Should we write "Jewish" here instead of "Israeli"?

Probably not. The names you gave are (mostly modern) Hebrew names, if anything. Hebrew names are a subset of Jewish names (again, many of which can be misclassified as European, Muslim, African, etc.). Diasporic ethnic minorities are a problem to classify geographically, due to being, well, dispersed.

idoerg commented 4 years ago

Nonetheless, I agree with you that "nationality" may be a confusing term. Perhaps "Estimation of name origins" would be more proper?

I would say "name etymology".

idoerg commented 4 years ago

BTW, if it is geographical information you want, just use the geographic information in the mesh headings or in the author affiliation.

cgreene commented 4 years ago

@idoerg : The goal of this first effort is to measure honor and authorship rates. I agree that what we are observing are differences by name etymology, which is a more precise phrasing.

The long-term objective would be to understand are reasons behind disparities in invitation rates. In this case, we might want to know if scientists within certain geographic regions (say, the US and Europe) by affiliations but with predictions denoting a high confidence of East Asian name etymology are also honored at lower rates or if the disparities arise from geographic bias from the organizations doing the honoring.

Thank you for your comment - it has been really helpful in clarifying my thinking on this. I propose that we switch to the "name etymology" term now and more fully lay out potential future avenues of research that would get at the underlying disparities more precisely towards the end of the results or the start of the conclusions.

dhimmel commented 4 years ago

Looking some more into the term "nationality", I am starting to think that it is the least inaccurate word for what we're extracting from Wikipedia (a mix of place of birth and country adjectives).

From https://www.merriam-webster.com/dictionary/nationality

image

From https://en.wikipedia.org/wiki/Nationality

Nationality is a legal relationship between an individual person and a state. Nationality affords the state jurisdiction over the person and affords the person the protection of the state. What these rights and duties are varies from state to state.

I see how collapsing nations into the Name-Prism categories, which are labeled by things such as religion, creates confusion and is a leap from nationality.

Nationality os more vague: in the US, it is often confused with citizenship. But actually in the US, a US national may not be a US citizen.

I think we need to be clear that we're using the Wikipedia country extraction as a proxy for nationality. That it's not an exact match of nationality, but it seems like we are assigning the correct nationality to the overwhelming majority of Wikipedia names, if we are to go off of the definitions above. Thoughts?

cgreene commented 4 years ago

@dhimmel : I am not sure that we want to look specifically at nationality with our analysis. If there is a bias against honoring scientists with a family history in a country within a grouping I think we would want to detect that, even if it is not due to current nationality.

I agree that the Name Prism categories are a large leap from nationality.

cgreene commented 4 years ago

From reading more of the wikipedia documentation, I agree with the comments that what we have at our disposal is what Wikipedia editors interpret to be nationality. We need to increase the specificity of how we describe this.

idoerg commented 4 years ago

I disagree with @dhimmel. Nationality is probably the most inaccurate wording you can used, given that there are 5 definitions in MW, some contradictory.

The image @dhimmel has shown is the exact confusion of nationality as synonymous with citizenship. Go with 5:

image

So I looked a bit deeper into the labeling table you were using, country_to_region.tsv

This made me chuckle:

image

Not sure why "Italian" is there?

So I looked a bit more into this table, and is many things there seem to be patently wrong:

1) Israel is a geographic region of its own. We are talking about a country smaller than New Jersey, and in constitutes a whole geographical region? No other single country in that table comprises a region. I believe that is a cause that the Israel overrepresentation bias came up in Figure 4. If you had New Jersey as its own geographical region, you would probably have New Jersey over-represented as well. (and people would wonder why you selected New Jersey as a geographical region of its own).

2) Kazakhstan and Afghanistan are part of a "Muslim" region? But Indonesia (the largest Muslim country) and Malaysia are not? India is not, even though it has the third largest Muslim population in the world?

3) Also, Kazakhstan and Afghanistan are part of Greater Africa. Not sure how they got there. Those are Central Asian countries.

These are not anomalies, many, probably most, entries in country_to_region.tsv are wrong, some are completely arbitrary, and I am not sure what this table is trying to represent. Which, finally, explains to me why the categories in Table 1 are so confusing.

cgreene commented 4 years ago

There are a few things to address here. We are continuing to make revisions to both the figures and the text. The categories on the rightmost column that you are referring to are not used, so we will remove those for clarity.

I did want to briefly address:

Israel is a geographic region of its own. We are talking about a country smaller than New Jersey, and in constitutes a whole geographical region? No other single country in that table comprises a region. I believe that is a cause that the Israel overrepresentation bias came up in Figure 4. If you had New Jersey as its own geographical region, you would probably have New Jersey over-represented as well. (and people would wonder why you selected New Jersey as a geographical region of its own).

We performed the pubmed analysis with the exact concern about what would happen if a classification call was particularly inaccurate. We don't see the "New Jersey" effect that you propose there.

idoerg commented 4 years ago

We performed the pubmed analysis with the exact concern about what would happen if a classification call was particularly inaccurate. We don't see the "New Jersey" effect that you propose there.

Probably because you did not train on New Jersey names :)

Do this:

  1. Fold Israel into Europe (or Asia, or Africa, or Muslim).

  2. Make Denmark its own region, separate from Europe.

  3. Train on Danish names

Two things will happen: 1: Israel will not be over-represented anymore, because it is not a region of its own.

  1. Denmark may be over-represented.

Repeat with :netherlands: :de: etc.

Then take all Nordic countries together. Use Nordic names.

The specific issue is that a regional classification where any single country, especially a small one, comprises its own region, while all the other countries get rolled into multi-country regions doesn't make sense to me.

The bigger picture here is that any regional division will probably yield different results. What if you separated Europe into Nordic and everything else? Germanic and everything else? EMBO? Why is there Hispanic and "Celtic English" but no Francophone (which will include, in addition to France and Switzerland, Quebec, New Orleans, and countries that are now lumped in Africa or East Asia). How about separating Japan from the rest of East Asia?

trangdata commented 4 years ago

Acknowledging that the names of the groups of countries were not appropriate as some were by country, some were by region, and others were by religion, we have retained the data-driven groupings but selected more appropriate names for the name origin groups. We also performed an analysis of author affiliations as suggested by @idoerg and the reviewers to detect the affiliated countries for authors and honorees (see #35, #36 and #87). This is a major improvement of the study because, in the past work, an author's affiliation and name origin have some chance of being interlinked. Now we can directly examine geographic discrepancies. Within the most represented country (the US), we also examined differences by name origin to remove the geographic confounded. We found that both components (geography and name origin) still play a role.