read_authors() is failing to pull in a large number of the author addresses.

embruna commented 6 years ago

read_authors() is failing to pull in a large number of the author addresses. For instance, papers in 2018 WOS dataset the 21400 authors, but read_authors() only pulled in addresses for 59.4% of them. This is probably due to the way they are organized in the WOS AD cell (what separates groups of authors: brackets, colons, commas, etc) and the code is failing to recognize these breakpoints.

For example, search output of read_authors and read_references for Siliman, M. There are seven records in read_references, all with address. In read_authors output "master" the 7 are there, but only 3 have the address.

aurielfournier commented 6 years ago

hmm.

I suppose this problem isn't surprising, since every journal does things differently. I need to think this over more. There might be a way to have the function figure out, line by line, what character is probably the one it needs to parse the address by, but I need to think through how that would work.

Any ideas @birderboone ?

birderboone commented 6 years ago

So this is actually a known problem I had forgotten about. Here's whats happening in Emilios "Silman, Miles" example:

In the 4 cases where an address is linked the name lines up with the affiliation. The address column lists them like [Author1] address, [Author2] address, etc. This way we can link the author name to the address really easily

Not all journals store the addresses using this bracket/address manner but just list them. In the ones where it doesnt match them it simply lists addresses w/o names. Which would be ok IF there were an equal number of addresses as there are author names. Which there ARENT.

"Vegetation dynamics of predator-free land-bridge islands" from WOS2018/G.txt has 5 authors and 4 addresses.

"A comparison of tree species diversity in two upper Amazonian forests" from WOS2018/K.txt has 8 authors and 5 addresses.

And as far as I can figure out there is no way to know WHO the addresses belong to except the very first address and author. Unless I'm completely missing a column that rectifies this.

birderboone commented 6 years ago

Also while working through issue #28, emails are the same way. We cant match most emails because theres no telling who the email actually matches to.

aurielfournier commented 6 years ago

That is a bummer, but not much can be done about it.

embruna commented 6 years ago

I completely forgot about this problem - yes, many of the journals records - even in the same year - different in the way in which the different author addresses are organized.

While we ca't assign each one individually, can we at least flag it somehow to distinguish between cases (are there even any?) where an author really has no address and those where it simply can't be pulled out? In other words, instead of a blank or NA, can we put in something like "Unable to extract"? it's worth pointing out to people in the vignette that you can see the possible addresses in the addresses, but to do so you have to go to the original article.

Alternatively, maybe it should be the opposite: put in all addresses, and let users know that it's one of the ones there, but because of the WOS file structure they can't be definitively associated.

birderboone commented 6 years ago

My only thought about pulling them all in and just announcing they couldnt be matched is I dont know how that would be stored. Its not like we can do analysis on that.

However, highlighting cases were it just cant be matched up, might allow people to physically change the source file or input. Albeit with a lot more work on their part. I'll definitely add in the 'cant be matched' input because thats a simple solution

birderboone commented 6 years ago

Added a 'Could not be extracted', this pairs well with the 'No affiliation' designation.

ropensci / refsplitr

read_authors() is failing to pull in a large number of the author addresses. #35