ufvivotech / ufDataQualityImprovement

Project to improve UF vivo data quality and accuracy
2 stars 1 forks source link

Merge additional UF author stubs #97

Open ankitbaderiya opened 11 years ago

ankitbaderiya commented 11 years ago

There are 29,673 additional stubs at present. Merge the stubs to the main profile.

ankitbaderiya commented 11 years ago

Merged stubs for 61 UF author entries having publications in the range 86-370.

mconlon17 commented 11 years ago

Hi. What does this mean -- publications in the range 86-370?

ankitbaderiya commented 11 years ago

I started merging stubs for authors having the highest number of publications first; started with UF author having 370 publications and proceeding in decreasing order of publications; completed stub merges for all authors having a total of 86 publications so far.

mconlon17 commented 11 years ago

Sounds good. How long does it take you to do these kinds of merges? For example, if an author has 10 stubs plus their primary profile, how long does it take to merge all these to the primary?

ankitbaderiya commented 11 years ago

If the author has a main profile it takes 3-4 minutes to merge 10 stubs. It takes time: (1) to establish whether those publications belong to that author (many authors have diverse research areas); (2) to search and establish identity when author is using multiple first names e.g. Michael/Mike, Douglas/Doug, etc. or has multiple last names or when first and last names are interchanged (3) to verify that the stub has merged correctly (4) when VIVO responds slowly (happens most of the time). If VIVO runs smooth, I am able to merge profiles for around 100 authors on average in 7-8 hours.

ankitbaderiya commented 11 years ago

Merged stubs for 111 UF author entries having publications in the range 56-85.

ankitbaderiya commented 11 years ago

Merged stubs for 91 UF author entries having publications in the range 44-55.

ankitbaderiya commented 11 years ago

Merged stubs for 59 UF author entries having publications in the range 40-43.

ankitbaderiya commented 11 years ago

Merged stubs for 127 UF author entries having publications in the range 33-40.

ankitbaderiya commented 11 years ago

Merged stubs for 125 UF author entries having publications in the range 28-32.

ankitbaderiya commented 11 years ago

Merged stubs for 68 UF author entries having publications in the range 26-28.

ankitbaderiya commented 11 years ago

Merged stubs for 104 UF author entries having publications in the range 24-26.

ankitbaderiya commented 11 years ago

Merged stubs for 76 UF author entries having publications in the range 22-24.

ankitbaderiya commented 11 years ago

Merged stubs for 60 UF author entries having publications in the range 21-22.

ankitbaderiya commented 11 years ago

Merged stubs for 52 UF author entries having publications in the range 20-21.

ankitbaderiya commented 11 years ago

Merged stubs for 66 UF author entries having publications in the range 19-20.

ankitbaderiya commented 11 years ago

Merged stubs for 41 UF author entries having publications in the range 18-19.

ankitbaderiya commented 11 years ago

Merged stubs for 106 UF author entries having publications in the range 17-18.

ankitbaderiya commented 11 years ago

Merged stubs for 112 UF author entries having publications in the range 15-17.

ankitbaderiya commented 11 years ago

Merged stubs for 77 UF author entries having publications in the range 14-15.

ankitbaderiya commented 11 years ago

Merged stubs for 115 UF author entries having publications in the range 13-14.

ankitbaderiya commented 11 years ago

Merged stubs for 120 UF author entries having publications in the range 12-13.

ankitbaderiya commented 11 years ago

Merged stubs for 79 UF author entries having 12 publications each.

ankitbaderiya commented 11 years ago

Merged stubs for 87 UF author entries having publications in the range 11-12.

ankitbaderiya commented 11 years ago

Merged stubs for 114 UF author entries having publications in the range 10-11.

ankitbaderiya commented 11 years ago

Merged stubs for 57 UF author entries having 10 publications each.

mconlon17 commented 10 years ago

Current count is 16,348. The query below defines a stub as a UF Person with at least one paper who does not have a UFID.

SELECT ?uri (COUNT (DISTINCT ?a) AS ?npapers) WHERE { ?uri a ufVivo:UFEntity . ?uri a foaf:Person . ?uri vivo:authorInAuthorship ?a . FILTER NOT EXISTS {?uri ufVivo:ufid ?ufid .} } GROUP BY ?uri ORDER BY DESC(?npapers)

Some "stubs" are complete profiles for people who were never associated with their UFID. This will also need to be addressed. We can add a few more selectors to the above query to separate these profiles from true stubs.

mconlon17 commented 10 years ago

18,000 stubs. I worked the top five -- some had parallel profiles, some had left the university. Was able to associate UFID with primary profile and merge in stubs. There should be an on-going activity here, but will require new tools -- merge and other automation.

We are capturing many publications, but many of them are not associated with fully identified people, that is, people who have a UFID. This is not surprising -- we have many authors who are not normally added to VIVO -- OPS and graduate students and others. These people enter VIVO through the publication process without a UFID. Subsequent effort is required to match the person to a UFID at UF. We could consider making the publication process smarter about adding people. If it can find candidates in the contact data (with UFID) it might be able to match from the start.