ropensci / refsplitr

R package for processing, organizing, and visualizing reference records downloaded from the Web of Science.
https://docs.ropensci.org/refsplitr
Other
55 stars 6 forks source link

I found that when two authors share the same name, they are incorrectly grouped together as the same author. #101

Open codetsang opened 3 weeks ago

codetsang commented 3 weeks ago

In the data processing, I found that if two authors have the same name, for example, Yang, Xia (University A) and Yang, Xia (University B), they are grouped as the same author, even though they are from different universities. In this case, different authors with the same name are actually distinct individuals. Could this be considered a significant issue for the project?

Here is an example dataset (These authors are different individuals but are incorrectly grouped as the same author):

  | authorID | groupID | author_name | author_order | address | university | department | postal_code | city | state | country | RP_address -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 429 | 429 | 4595 | Yang, Xia | 6 | Univ Malaya, Kuala Lumpur, Malaysia. | univ malaya | NA | NA | kuala lumpur | NA | malaysia | NA 1211 | 1211 | 4595 | Yang, Xia | 1 | Cent South Univ, Xiangya Hosp 3, Dept Pediat, 138 Tongzipo Rd, Changsha 410013, Hunan, Peoples R China. | cent south univ | xiangya hosp 3 | 41001 | changsha | hunan | peoples r china | NA 1294 | 1294 | 4595 | Yang, Xia | 5 | Air Force Med Ctr, Dept Anesthesiol, Beijing, Peoples R China. | air force med ctr | NA | NA | dept anesthesiol | beijing | peoples r china | NA 1505 | 1505 | 4595 | Yang, Xia | 6 | Shenzhen Univ, Shenzhen Peoples Hosp 2, Affiliated Hosp 1, Dept Traumat Orthoped,Shenzhen Translat Med Inst, Shenzhen 518028, Peoples R China. | shenzhen univ | shenzhen peoples hosp 2 | 51802 | shenzhen | NA | peoples r china | NA 1723 | 1723 | 4595 | Yang, Xia | 2 | Jiangsu Univ Sci & Technol, Sch Comp, Zhenjiang 212003, Jiangsu, Peoples R China. | jiangsu univ sci & technol | sch comp | 21200 | zhenjiang | jiangsu | peoples r china | NA 3647 | 3647 | 4595 | Yang, Xia | 1 | Nanjing Univ Chinese Med, Affiliated Hosp Integrated Tradit Chinese & Wester, Dept Endocrinol, Nanjing, Peoples R China. | nanjing univ chinese med | affiliated hosp integrated tradit chinese & wester | NA | dept endocrinol | nanjing | peoples r china | NA 4072 | 4072 | 4595 | Yang, Xia | 1 | Dalian Maritime Univ, Nav Coll, Dalian, Peoples R China. | dalian maritime univ | NA | NA | nav coll | dalian | peoples r china | NA 4479 | 4479 | 4595 | Yang, Xia | 1 | Shandong Univ, Qilu Hosp, Cheeloo Coll Med, Dept Neurosurg, Jinan, Peoples R China. | shandong univ | qilu hosp | NA | dept neurosurg | jinan | peoples r china | NA 4541 | 4541 | 4595 | Yang, Xia | 4 | Hunan Univ Chinese Med, Coll Chinese Med, Changsha 410208, Hunan Province, Peoples R China. | hunan univ chinese med | coll chinese med | 41020 | changsha | hunan province | peoples r china | NA 4595 | 4595 | 4595 | Yang, Xia | 1 | Beijing Univ Chinese Med, Grad Sch, Beijing, Peoples R China. | beijing univ chinese med | NA | NA | grad sch | beijing | peoples r china | NA

https://docs.ropensci.org/refsplitr/articles/refsplitr.html#author-address-parsing-and-name-disambiguation

Once we have our subset of possible similar entries, we match the existing info of row 1 against the subset. The entry only needs to match one extra piece of information - either address, email, or middle name. If it matches we assume it is the same person, and change the groupID numbers to reflect this.