Closed embruna closed 6 years ago
by the separate
function do you mean this set of code?
dat <- separate(data=eb_refined, col = address,
into=c("university","department","short_address"),
sep=",",extra = "merge", remove=FALSE) %>%
mutate(country=stri_extract_last_words(short_address),
zip = str_extract(string=short_address,
pattern="[:digit:][:digit:][:digit:][:digit:][:digit:]"),
city_state = str_extract(string=short_address,
pattern="[:alnum:]{1,20}[,][ ][A-Z][A-Z]") ) %>%
select(address, short_address, city_state, zip, country, university, department)
If so, we can certainly wrap that in a refnet specific function to parse the institutions.
For 1 we could make an institution parsing function. Since
for 2 - we could either make that apart of the institution parsing function, or we could provide guidance in the vignette on how the user can do it themselves. At that point the data are all in a dataframe, so it won't take anything fancy to subset the file, all the function or user would be doing is using some standard R subsetting code.
Yes, that's the code I meant. I was thinking about it in the step-by-step way for people less familiar with R.
I know what I would want - a data frame of each unique author with their ID number, and the information we can parse out of the address (institution, country...maybe city or zip?). But people move, so some people will have multiple entries for institutions,country,city,zip).
One way to make it easier might be to do it as part of refine_authors
, which already lists each author of each article - all we'd be doing is adding columns for their country, institution,etc **for that reference record*** to the end of that.
Address parsing is now apart of read_authors
.
it pulls out the university, department, postal code, country name, and address.
I'll work next on getting the address_lat_long()
function. so it just does the georeferencing
Cool. Does it append it to an existing output or does it spit out a separate one? Just want to update the vignette before I forget.
Found it, this is awesome.
The
separate
function was huge because it finally made it possible to do analyses based on institution. Can we polish this and its output and add to the flow chart? eg., can we link individuals authors and the output indat
? Perhaps there are two things to do:1) Allow users to run
separate
afterrefine_authors
and add a column with the parsed institution name toauthors
in thefoo_authors
list2) Allow people to spit out csv file of
dat
with all the institutions, departments, etc for users who want this (with the caveat that it is messy output in some fields)