tpoisot / esa2014twitter

Analysis of #ESA2014 tweets
MIT License
2 stars 1 forks source link

Name errors in co-author network analysis #2

Open noamross opened 9 years ago

noamross commented 9 years ago

Generating a network of co-authors from names in the ESA program is problematic because an individual's name may vary year-to-year or even within a years. This is primarily due to two reasons - mis-entering co-author's names ("Rich" instead of "Richard", and "Simon Levin"/"Simon A. Levin"), and because people's names actually change. The latter primarily affects early-career women who are likely to change their names due to marriage.

This introduces some systematic bias into any measures of the network, so we have to ask (1) How can we correct these errors, and (2) what measures are robust to these errors? Network analysis isn't my forte, so I'd like some feedback on this.

Regarding (1):

Regarding (2):

Other thoughts?

tpoisot commented 9 years ago

Re. the abbreviation of names, perhaps we can take the first n (= 3, 4) letters of each first name, that should solve some of it.

My guess is that unless some really well connected people are systematically affected by this, the network metrics should be relatively robust to that. Primarily because the number of people involved is really large, so the nodes that contribute a lot to overall properties should be few.