Address Parsing and Georeferencing as 2 steps

embruna commented 6 years ago

The separate function was huge because it finally made it possible to do analyses based on institution. Can we polish this and its output and add to the flow chart? eg., can we link individuals authors and the output in dat? Perhaps there are two things to do:

1) Allow users to run separate after refine_authors and add a column with the parsed institution name to authors in the foo_authors list

2) Allow people to spit out csv file of dat with all the institutions, departments, etc for users who want this (with the caveat that it is messy output in some fields)

aurielfournier commented 6 years ago

by the separate function do you mean this set of code?


dat <- separate(data=eb_refined, col = address, 
         into=c("university","department","short_address"),
         sep=",",extra = "merge", remove=FALSE) %>%
       mutate(country=stri_extract_last_words(short_address),
        zip = str_extract(string=short_address, 
          pattern="[:digit:][:digit:][:digit:][:digit:][:digit:]"),
        city_state = str_extract(string=short_address,
                pattern="[:alnum:]{1,20}[,][ ][A-Z][A-Z]") ) %>%
      select(address, short_address, city_state, zip, country, university, department)

If so, we can certainly wrap that in a refnet specific function to parse the institutions.

For 1 we could make an institution parsing function. Since

for 2 - we could either make that apart of the institution parsing function, or we could provide guidance in the vignette on how the user can do it themselves. At that point the data are all in a dataframe, so it won't take anything fancy to subset the file, all the function or user would be doing is using some standard R subsetting code.

embruna commented 6 years ago

Yes, that's the code I meant. I was thinking about it in the step-by-step way for people less familiar with R.

I know what I would want - a data frame of each unique author with their ID number, and the information we can parse out of the address (institution, country...maybe city or zip?). But people move, so some people will have multiple entries for institutions,country,city,zip).

One way to make it easier might be to do it as part of refine_authors, which already lists each author of each article - all we'd be doing is adding columns for their country, institution,etc **for that reference record*** to the end of that.

embruna commented 6 years ago

Would rOpenSci's geoparser be useful?

aurielfournier commented 6 years ago

Address parsing is now apart of read_authors.

it pulls out the university, department, postal code, country name, and address.

I'll work next on getting the address_lat_long() function. so it just does the georeferencing

embruna commented 6 years ago

Cool. Does it append it to an existing output or does it spit out a separate one? Just want to update the vignette before I forget.

embruna commented 6 years ago

Found it, this is awesome.

ropensci / refsplitr

Address Parsing and Georeferencing as 2 steps #29