schochastics / CRAN_collaboration

Analysing the collaboration graph of R package developers on CRAN
8 stars 1 forks source link

CRAN Collaboration Graph and Hadley number suffering from heterogeneous spelling issues #2

Open joseffrank opened 10 months ago

joseffrank commented 10 months ago

Thank you very much for this amazing idea and doing the calculations.

What immediatly came to my mind was the daily problem of heterogeneous spelling of Authors' names, like in the example below, where it appears to name four different persons, however almost surely comes from only one person.

Clearly cleaning that up is not trivial. Nonetheless, the ranking would greatly benefit from some further efforts into this direction.

name Hadley Number centrality centrality ranking
Frank E. Harrell Jr 1 3.51163974049103 157
Frank E Harrell Jr 1 3.59140058516728 245
Frank Harrell 2 4.01450197175932 1307
Frank Harrell Jr 2 4.76173514819997 6318.5
schochastics commented 10 months ago

Yes I m afraid there are many such cases and they need to be detected and cleaned manually.