iuni-cadre / Fellow4-measuring-dynamics

For the CADRE fellowship group Measuring and Modeling the Dynamics of Science Using the CADRE Platform
0 stars 0 forks source link

Gender identification of Asian names #2

Open XiaoranYan opened 4 years ago

XiaoranYan commented 4 years ago

On 1/15/20 3:20 PM, Jina Lee wrote: Dear Dr. Yan,

It's nice to meet you virtually!
I am working on a project (PI Russell Funk and Co-PI Erin Leahey) about impacts of scientific articles and our team has been trying to code gender by using first names in the Web of Science data. We've utilized an R package for gender coding based on SSA, but found that it is very challenging to identify the gender of Asian names. Our team thus has been reaching out to people who might know of relevant data sources. If you have any suggestions regarding this issue, that would be very helpful for us.

Best,

Jina

On Wed, Jan 15, 2020 at 11:44 AM Hutchinson, Matthew Alexander maahutch@iu.edu wrote: Hi Jina,

Could you send me your IU username? That way I can check if you Carbonate account is active.

I’m not sure about the dataset for the gender coding of Asian names. I’m not aware of anything like that in out existing MAG and WoS data sets. I’ve cc’d one of our PI’s Xioaran Yan who is a research scientist working with the Cadre data in case he is aware of a data set that would provide this information.

Thanks,

Matthew Hutchinson | INDIANA UNIVERSITY Data Manager IU Network Science Institute (IUNI) 1001 E SR 45/46 Bypass | Bloomington, IN 47408-1415 Email: maahutch@iu.edu |Phone: (812) 855-1404 Fax: (812) 856-1192

From: Jina Lee jinal@email.arizona.edu Sent: Wednesday, January 15, 2020 11:46 AM To: Hutchinson, Matthew Alexander maahutch@iu.edu Subject: [External] Re: Access to Cadre Data Server

This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources.

Hi Matthew,

Hope you have started a great New Year.

I have a question about Carbonate account. I created the account but haven't received an approval email or something, and I couldn't find relevant information on the website(https://kb.iu.edu/d/aolp) either. Would you please let me know what should I do to check the status?

I also am looking for a dataset for coding gender of Asian names but don't know whom to contact. Is that Carbonate account necessary to ask data questions? Any help would be very appreciated.

Best,

Jina

XiaoranYan commented 4 years ago

Hi Jina, We've utilized an R package for gender coding based on SSA, but found that it is very challenging to identify the gender of Asian names. I used that package a couples years ago myself and found out the same limitations. We had no choice but limit our scope to English names in the end. Our team thus has been reaching out to people who might know of relevant data sources. If you have any suggestions regarding this issue, that would be very helpful for us. Asian names are in general very challenging even for disambiguation. Many people have resort to manual labeling, and here is an example https://www.nature.com/news/1.14321#/supplementary-information We are also very interested in any labeled data for disambiguation or gender detection purposes. If your team are interested, I am happy to organize a zoom meeting with Vincent Larivière who said data sharing might be possible. He "promised" to me a few years ago, but perhaps this time with the CADRE community, we can make it happen. We would also appreciate any information you might have in your reaching out effort. Thanks! Xiaoran

XiaoranYan commented 4 years ago

Hi Xiaoran,

Thank you again for your suggestion! I cc'd the team members here -- Russell Funk, Erin Leahey, Eugene Paik, Micheal Park; we are all interested in the zoom meeting with Vincent Larivière to learn more about gender coding. We would appreciate it if you could arrange the meeting.

Best,

Jina ---------- Forwarded message --------- From: Jina Lee jinal@email.arizona.edu Date: Wed, Jan 15, 2020 at 9:39 PM Subject: Re: [External] Re: Access to Cadre Data Server To: Xiaoran Yan yan30@iu.edu

Hi Xiaoran,

Thank you so much for your response and kind suggestion! That would be wonderful. I'll ask our team if they are interested.

Mary Kaltenberg let us know some dictionaries and articles her team drew on to code Asian names.

Hope this helps. I'll get back to you as soon as I get replies from our team.

Thanks again!

Jina

XiaoranYan commented 4 years ago

Hi all, Sure, I’d be happy to. I’m available this week Wed and Thursday AM. Cheers, V.

From: Yan, Xiaoran yan30@iu.edu Sent: January 24, 2020 12:58 PM To: Larivière Vincent vincent.lariviere@umontreal.ca Cc: Russell Funk rfunk@umn.edu; Leahey, Erin E - (leahey) leahey@email.arizona.edu; Eugene Paik epaik@bus.olemiss.edu; Michael Park park1892@umn.edu; Jina Lee jinal@email.arizona.edu Subject: Exploring collaboration and data sharing possibilities

Hi Vincent,

This is Xiaoran Yan from IU and our collaborators from the CADRE project.

https://cadre.iu.edu/fellows/measuring-and-modeling-the-dynamics-of-science-using-the-cadre-platform

Last time we met, I told you that we are building a data sharing platform for science of science research. It is now called CADRE. If I remember correctly, you mentioned that it is possible to share some of your prized author disambiguation dataset. As it turns out, there is also a lot of interest in your gender labeled data from the following study https://www.nature.com/news/1.14321#/supplementary-information For more details of our previous discussion, you can refer to this GitHub issue https://github.com/iuni-cadre/Fellow4-measuring-dynamics/issues/2 Besides our personal interests in these datasets, this is also perfectly aligned with our data-sharing mission. If you are willing to explore the collaboration possibilities, the CADRE team will be happy to provide data infrastructure support and make this into a use case of open/private data sharing with reproducible results. Considering the technical details, it would be great if we can setup a zoom meeting to start the discussion. Feel free to rely to this email if you have any question.

Thank you and we look forward to hearing back from you!

Xiaoran Yan

XiaoranYan commented 4 years ago

Hi Erin,

That is great news! We can certainly help you get the affiliation data.

I can meet tomorrow at 12pm-1pm est (2pm-3pm MST) to discuss the details.

Let me know if it works for you. I can set-up a zoom meeting.

Xiaoran

On 2/12/20 5:24 PM, Leahey, Erin E - (leahey) wrote: This message was sent from a non-IU address. Please exercise caution when clicking links or opening attachments from external sources.

Dear Xiaoran,

Thank you for connecting us with Vincent! We spoke last week, and he has kindly (already!) shared his data with us, so we're very grateful and excited.

In order to apply his refined and country-specific gender classification scheme, we need data with author affiliation (specifically country) for each author. I think this is most likely for the WoS papers published 2008-present. My co-I Russ Funk (Minnesota) has not been working at all with author affiliation field, and doesn't have country parsed out, so encouraged us to connect with you. Jina and I have no experience with SQL, but are very interested in working with you to obtain the data we need for this joint project with Russ.

If it would be helpful to talk, I am free Thursday MST 7:30-10:45 and 12-2:45 (and I think you are 2 hours ahead, in EST?), and Jina could join us 12:15-2. Friday Jina and I are available 1:30-3:30 MST.

Thank you! Erin


Erin Leahey
Professor & Director, School of Sociology
University of Arizona
located on Tohono O'odham Nation homelands and the lands of the Pascua Yaqui Tribe
http://sociology.arizona.edu/leahey
pronouns: she/her/hers
XiaoranYan commented 4 years ago

Dear Xiaoran,

Thank you for talking last week, and sorry for the delay in getting back to you. We're very grateful you could pull some data for us for the project with Russ! To apply Vincent's gender classification scheme, and then merge it with paper-level measures like the CD index that Russ has computed, I think we'd need the following variables for WoS papers published in 2008 or later:

From the wos_address_names table: id addr_id name_id lang_id first_name last_name full_name

From the wos_addresses table: id addr_id country country_lang_id

I've put the most critical variables in blue. This is my best guess based on the chart we reviewed last week. However, if you have variable labels or other information, or can send a small sample of 10-100 observations for us to review, I can check and then modify the request of need be.

Jina and I meet tomorrow 1:15-3pm or so MST if you need to ring in with questions or clarifications.

Thank you again for all your help! Erin

XiaoranYan commented 4 years ago

Hi Erin and Jian,

The tables (in compressed gz format) are ready for you to download.

I have kept the 2 tables separate for smaller file sizes. If I were to join them, it would be much bigger. The columns are also changed slightly, as follows

From the address_names table: id addr_no_raw (this is the original order an address appear a paper) seq_no (this is the original order an author appear a paper, an address can be associated with multiple author and an author can have multiple addresses) role (I think you are looking for “real person authors”, but some group or corporation dose show up here) reprint (“Y” indicates that an author is the corresponding author of a paper) lang_id (a lot of missing data) first_name last_name full_name year

From the wos_addresses table: id addr_no_raw (this is the original order an address appear a paper) full_address (added for verification, there are some noisy data for country) country country_lang_id(a lot of missing data) year

Let me know if you have any question regarding the datasets.

Thanks!

Xiaoran