iuni-cadre / Collaborative-projects

For non-fellow collaborative projects on CADRE
9 stars 0 forks source link

WoS citation data integration with NIH grant data (Chenwei Zhang) #8

Open XiaoranYan opened 5 years ago

XiaoranYan commented 5 years ago

Dear Xiaoran and Patricia,

Hi. This is Chenwei. It was so nice to meet you today and discuss my dissertation. Thanks for your kind suggestions.

Currently my work focuses on measuring the diversity of teams. I extracted five basic features for the measurement, including:

the scientific age of an author in each team (the current publication year - the first publication year + 1)

an author's impact (citation/h-index)

an author's productivity (number of publications in the corpus)

an author's research topic

an author's country

I really hope I could expand my work from only ACM dataset to another domain, such as the bio domain with the pubmed dataset.​ I have discussed some potentials with Xiaoran. It will be great if we could have a PI/Co-PI dataset with their publication records. We first need to define a team by the co-investigation relation between these individuals. Then for each member within the team, we want to extract his/her features as I listed above. Ideally we want to have all their publications from a broad dataset (for example, the pubmed/WOS). In case we could only extract publications associated with the grants, just as Katz' report, we will try to claim these features (except for the country) are more from the grant perspective.

Kindly let me know if you have any questions. Thank you so much for your great help!

Best regards,

Chenwei Zhang PhD Candidate in Information Science / Adjunct Lecturer School of Informatics, Computing, and Engineering Indiana University Bloomington

XiaoranYan commented 5 years ago

Hi Chenwei,

Please redirect all follow up conversations to our GitHub repo and only use email if privacy is a concern. Please also invite me as a collaborator to your own github repo if there is one for this particular project.

Here is the preliminary data from the Katz' report. The data consists of three CSV tables, and you can download it from the link (will be valid for a week) https://iunimag.blob.core.windows.net/mag-2019-01-25/KatzData2.tar?st=2019-04-01T19%3A40%3A56Z&se=2019-04-12T19%3A40%3A00Z&sp=rl&sv=2017-07-29&sr=b&sig=XgXPgRH6jvTqPY5EHnWHVcAuKxK1Nd4JvYP8A2K%2BAGc%3D

Authors.csv contains basic information of the PIs with following columns: PI_NAMEs|pi_id|FULL_PROJECT_NUM|ORG_DUNS|ORG_NAME|ORG_DEPT|ORG_CITY|ORG_STATE|ORG_FIPS, where "ORG_DUNS" are the unique organization id and "ORG_FIPS" is for country (mostly us).

Teams.csv contains grant team information from NIH exporters with following columns: PI_IDS |APPLICATION_ID|FULL_PROJECT_NUM |CORE_PROJECT_NUM|PROJECT_TITLE |PI_count|BUDGET_START|BUDGET_END|TOTAL_COST, where "PI_IDS" maps to the Authors.csv and form grant teams. Many NIH grants are renewable each year and "CORE_PROJECT_NUM" spans multiple rows in this table. "PI_count" is the team size and the data set is dominated by single PI grants. Only 48,276 out of the 2,207,977 rows contains more than 1 PIs.

Papers.csv contains papers level information from the Katz' data with following columns: AUTHOR_LIST| PMID|CORE_PROJECT| pi_id2|pi_lastname| Journal| Title|Year|citations| citingWoS, where "pi_id2" maps to the Authors.csv and "citingWoS" is the total citation count from the PubMed subset in our WoS data (those can be mapped with DOI, which covers about 60% of all PubMed paper). "citations" are provided by the original authors, and they gathered their data from the Elsevier Developer’s API. I have not compared these two citation numbers in details, but there seems to be some differences. This table is also very messy with PMID duplicates. The authors created multiple rows for each "pi_id2" matches and I have kept it as is for easier mapping to Authors.csv.

Please notice that all data provided by the Katz' study are not as "clean" as they claimed to be. I have identified many duplicate records and many may still remains after my cleaning. Please try to use unique identifiers such as "pi_id", "FULL_PROJECT_NUM" and "PMID" when doing statistics analysis.

Please feels free to ask questions if you have any questions about the data. From my experience, the final dataset will take several updates with your feedback.

Thanks! Xiaoran

zhang334 commented 5 years ago

Thank you so much, Xiaoran! I have downloaded the dataset. I will explore it after I am back from the iconference. I will let you know if I get some updates.