ropensci / ozunconf17

Website for 2017 rOpenSci Ozunconf
http://ozunconf17.ropensci.org/
24 stars 6 forks source link

Extend the nhmrcData package (Australian National Health & Medical Research Council funding outcomes) #27

Open timchurches opened 6 years ago

timchurches commented 6 years ago

The nhmrcData package for R by Neil Saunders (see https://nsaunders.wordpress.com/2017/03/15/an-update-to-the-nhmrcdata-r-package/ and https://github.com/neilfws/politics/tree/master/nhmrcData ) nicely packages most of the data published by the Australian NH&MRC (National Health & Medical Research Council) as tidy (as in tidyverse) R data.

However, what's missing are the names of all the CIs (chief investigators) on each successful Project Grant - currently only the name of the CIA (chief investigator A) is available. These data are available, from the current round back as far as 2003 - see https://www.nhmrc.gov.au/grants-funding/outcomes-funding-rounds/previous-outcomes-project-grants-funding-rounds However, the data are contained in PDF documents, with the layout of those documents changing every few years. Thus, some work is required to extract the full set of CIs for each Project Grant. This could be done manually, given that it is a once-only task, but a better approach might be to use the open-source CrossRef pdf-extract tool (see https://github.com/CrossRef/pdfextract ) to extract all the data in each PDF as regionalised XML, and then use R to transform that into some nice tidy data frames (or tibbles) that work well with the existing tibbles provided by the nhmrcData package. The approach used would be rather similar to that used by the Parse-CV package for R by K. del Rosso (see https://github.com/kdelrosso/Parse-CV ).

Why bother doing this? Well, access to the full set of CI names permits several additional analyses which might be of interest:

Such analysis (and visualisation) of the graphs would be the fun part of this project, but the first day may be needed to prepare and extract all the data. The resulting analyses would be of broad interest and are probably publishable (eg in the Medical Journal of Australia or PloS ONE etc), and thus a publication might result for participants in this project.

There are also some data on other NH&MRC funding streams that haven't been incorporated into the nhmrcData package yet. These could also be considered.

If there is interest in this proposal, then an attempt to contact Neil Saunders would be made, to involve him as the original data package author, or at least inform him of the project and its intentions.

A similar project could be undertaken for ARC funding, if such a thing is not already available.

dicook commented 6 years ago

I have code for ARC funding, that could be used as a base. The data for successful proposals is available for download without demographic information.

dicook commented 6 years ago

Oh, there is an interesting shiny app at https://aushsi.shinyapps.io/orcid/, https://github.com/agbarnett/helping.funders for academic publication record support. it seems to have some bugs if you request early year publications. Its pretty impressive, though.

timchurches commented 6 years ago

@dicook

Oh, there is an interesting shiny app at https://aushsi.shinyapps.io/orcid/, https://github.com/agbarnett/helping.funders for academic publication record support. it seems to have some bugs if you request early year publications. Its pretty impressive, though.

Agree, that's an impressive and potentially useful app (I had to do, by hand, exactly what it automates, three times in the last two weeks, and each time it took maybe an hour of fiddly manual editing).

Extension of the NH&MRC (and/or ARC) grant data to permit analysis of the publications of each of the CIs on each successful grant would be fascinating, although quite a bit more challenging to implement. It would permit all manner of additional analyses, including examination of the distribution of publication metrics across all CIs and aggregated by project grant or sub-discipline, and of course, examination of the effect of a project grant on subsequent publications (although allocating specific papers to specific grants might be tricky to do in any automated fashion, but who knows, something like doc2vec embedding might even work given enough data).

The challenge is in going from just the CI's name to their publication record. If they are dutiful, they will have all their publications listed and maintained in ORCID, in which case it is pretty easy once the correct ORCID ID is found (although even finding the correct ORCID ID is not so easy for investigators with common names, and some investigators manage to create several ORCID IDs for themselves, none of which have a canonical list of their publications...). But without ORCID, it is a much trickier problem.

One approach would be to harvest publications from investigator CVs (assuming availability of such CVs, but most investigators seem more willing to maintain a list of their publications in their CV than on ORCID). As mentioned above, there is an existing CV parser library for R (see https://github.com/kdelrosso/Parse-CV ) which works tolerably well (although it could do with a few tweaks).

Reference harvesting from CVs could then be used to train a model for each researcher which can distinguish that researcher's papers and publication from all the other papers and publications by researchers with the same name (or the same names in the case of researchers who change their name at some point in their career). Some feature engineering of each researcher's known-good publications, to extract institution, city, country, possibly subject matter area, and of course co-authors, might be needed to train a useful model.

Maybe such publication harvesting and analysis might be considered a stretch goal?

robjhyndman commented 6 years ago

CV parsing seems highly likely to fail. A better solution might be to use google scholar. The scholar package (https://cran.r-project.org/package=scholar) can be used.

timchurches commented 6 years ago

Here's a gist illustrating the sorts of things that can be done with co-authorship graphs, using ORCID as the source (but ORCID completeness tends to be very poor): https://gist.github.com/timchurches/442a7aaaa31f03d918cbef3c46f1d88a

timchurches commented 6 years ago

@robjhyndman

CV parsing seems highly likely to fail. A better solution might be to use google scholar. The scholar package (https://cran.r-project.org/package=scholar) can be used.

The author of the CV-parse package claims an 85% success rate in parsing 45k PDF CVs harvested from the internet. However, it failed miserably with mine, but that's my fault for using a fancy template with drop caps for each section heading (as my partner wryly observed, CVs are not supposed to look like they were illuminated by a mediaeval monk who last worked on the Book of Kells).

Google Scholar works really well for people who have created a Scholar ID for themselves, and the R scholar package relies on those Scholar IDs. But only a minority of researchers have Scholar IDs, alas, and if they don't, a Google Scholar search on just their name returns a potentially large superset of that researcher's publications, particularly if they have a common name. Training a model to winnow that superset down to the set of actual publications for that researcher was what I had in mind, as outlined above. I suspect that that is what Google and CrossRef do. Once you define your profile of definitely-your-publications in Google Scholar, it seems to use the attributes of those publications to automatically add new publications to your profile. It seems to be fairly smart about this, but then, my name is quite rare, and perhaps it's not so smart for someone with a more common name. My idea was to use known-good references culled from a CV to train a model which is specific to that researcher, which could then be used to refine all the results returned for their name by Google Scholar, when they don't have a Scholar ID. Given the categorical attributes (features) for each publication, some sort of tree model would probably work best, and thus it could just be one big tree model, with a branch for each researcher of interest.