ropensci / refsplitr

R package for processing, organizing, and visualizing reference records downloaded from the Web of Science.
https://docs.ropensci.org/refsplitr
Other
55 stars 6 forks source link

Author Country after Parsing address #36

Closed embruna closed 6 years ago

embruna commented 6 years ago

I looked over the results of the address parsing and how well it gets the country correct. I will send details via slack, but basically I will propose an alternate method for filling in the "country" column to minimize the (relatively few) errors.

Suggestion: use Emilio's "back to last comma" method.

aurielfournier commented 6 years ago

this should be easy to implement. I think its best to wait till we finish addressing #35 since, based on that issue, not all of them are comma delimited.

embruna commented 6 years ago

I see this as two distinct issues:

1) Can we associate an address and author (#35): Y/N

2) If so, let's tear apart the address a) extract country via back to last comma method b) rest of address using other methods

What do you think?

aurielfournier commented 6 years ago

so what we are doing right now is this.

So we are comma separating out the university, department and the 'short_address' which is basically the street address.

origAddress <- separate(data=final, 
                    col = address,
                    into=c("university","department","short_address"),
                    sep=",",extra = "merge", remove=FALSE) %>%
#extracts postal code
mutate(postal_code = str_extract(string=short_address, 
                    pattern="[:alpha:]{2}[:punct:]{1}[:digit:]{1,8}|[:space:][:upper:][:digit:][:upper:][:space:][:digit:][:upper:][:digit:]|[:alpha:][:punct:][:digit:]{4}")) %>%
#extracts postal code a second way if the first doesn't work
mutate(postal_code = ifelse(is.na(postal_code), 
                   str_extract(string=short_address,
                   pattern="[:space:][:digit:]{5}"), postal_code)) %>%
# extracts the postal code a third way if the first two don't work
mutate(postal_code = ifelse(is.na(postal_code), 
                  str_extract(string=short_address,
                  pattern="[:upper:]{1,2}[:alnum:]{1,3}[:space:][:digit:][:alnum:]{1,3}"),
                 postal_code))

Which leaves us with something that looks like this

 head(origAddress$short_address)
[1] " Gainesville, FL 32611 USA."     
[2] " BR-50372970 Recife, PE, Brazil."
[3] " BR-50372970 Recife, PE, Brazil."
[4] " BR-50372970 Recife, PE, Brazil."
[5] " BR-50372970 Recife, PE, Brazil."
[6] " BR-50372970 Recife, PE, Brazil."

the issue is, for some reason in this dataset, USA is not comma delimited, but most of the other countries are. Which presents an issue for using comma delimitation for the extraction of country names.

My memory is not always the best, but I think this is why I designed the country name extraction the way that I did, to get around this issue.

birderboone commented 6 years ago

I rewrote how we parse addresses, which adds some logic we can build on. This country issues seems to be resolved. Basically the last comma was how I took care of that, and then dealt with USA specifically to pull out the zip and state. I'm tenatively calling this resolved? But lets let everyone break it first.

embruna commented 6 years ago

In short, yeah - the USA without comma is messy (and who knows why it's set up that way). The easiest thing for me was to hack around it in a two-step format - one for USA and one for everything else. I'll try this new version and report back.