greenelab / iscb-diversity-manuscript

Analysis of ISCB Fellows and Keynotes Reveals Disparities
https://greenelab.github.io/iscb-diversity-manuscript/
Other
5 stars 6 forks source link

R4: nameprism groupings problemmatic #68

Closed cgreene closed 4 years ago

cgreene commented 4 years ago
  • Figure 4 and the category selection for the analysis are highly problematic. I understand that the categories were selected according to NamePrism, but it is ultimately the responsibility of the authors to justify choices. What is compared is a continent (Europe, Africa), a religion (Muslim), a country (Israel), a part of the continent (East Asian, South Asian), a group of different races or ancestries that spans continents (Hispanic), and then there are Celtic English folk. This analysis needs a substantial revision to the broader and more parallel categories, even compared to previous Figures in the manuscript. The genetics community has been successful discussing continental and subcontinental populations though the problem here is to infer those from names that reflect many things. Singling out Israel as overrepresented sends a tricky message when listed right next to Muslim as underrepresented. It surprised me to see this insensitivity in a study that is meant to assess where we are as a community and to promote sensitivities.
trangdata commented 4 years ago

I know that we switched to letter grouping, but @arielah and I also discussed the alternative of using a different grouping (e.g., the World Bank analytical grouping from rnaturalearth), which is more geography-based but still rather arbitrary. Do you think it would help? @cgreene @dhimmel I'm also okay with leaving it as is.

trangdata commented 4 years ago

relates to #38, #45

cgreene commented 4 years ago

From my read of the NamePrism paper, the methodology that they used makes more sense than geographic groupings. They construct embeddings based on contact chains. They use these embeddings to find similarities at the country level (see 4.3.2). This evidence seems to support the taxonomy.

At least some of the results that I found initially odd because of their contrast with geography (e.g., Bangladesh not being in SE Asia but instead among those countries with Arabic naming traditions) seem to be at least in some ways backed up by other information around naming traditions: https://en.wikipedia.org/wiki/Bengali_name

Name origins are fundamentally an individual-level property. Because of the limitations of the source data in Wikipedia, the best we can get for modern naming traditions classification is to the country level. For a grouping of countries by naming traditions, I haven't seen anything better than NamePrism. I wouldn't use their names for the groupings, which appear to be arbitrary (i.e., what should probably be called Arabic is called Muslim). However, I haven't see anything better than the groupings themselves from the point of view of the research question around name origins.