ropensci / refsplitr

R package for processing, organizing, and visualizing reference records downloaded from the Web of Science.
https://docs.ropensci.org/refsplitr
Other
55 stars 6 forks source link

simplify output of foo_authors to simplify manual verification #40

Closed embruna closed 6 years ago

embruna commented 6 years ago

I've been thinking about how to simplify the presentation of the information in the file users need to review to decide if disambiguation was done correctly. The following suggestions are based on lots of personal experience reviewing tables like this - you want to simplify as much as possible and keep only high information cells.

First, I suggest we delete the following columns from foo_authors and foo_authors.csv

  1. PU
  2. Author order
  3. PT
  4. PY
  5. UT or refID (only need one, refID is shorter so maybe keep it)
  6. RP address (the reprint email address is not always the address of the author being 2x)
  7. university
  8. department
  9. short address
  10. postal code

2) second, i suggest we rearrange to simplify comparison.

The comparison currently has names in long format:

groupID match_name similarity author_order address university department short_address postal_code country RP_address RI OI EM UT refID PT PY PU
39 NA NA 2 INIBIOMA CONICET UNCo, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. INIBIOMA CONICET UNCo Lab Ecotono RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. RA-8400 Argentina NA NA NA NA WOS:000285339300008 7 J 2010 NA
39 Lozada, Mariana 0.92 1 INIBIOMA, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. INIBIOMA Lab Ecotono RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. RA-8400 Argentina INIBIOMA, Lab Ecotono, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. NA NA mlozada@crub.uncoma.edu.ar WOS:000266655800033 1139 J 2009 NA
40 NA NA 1 INIBIOMA CONICET UNCo, Lab Fotobiol, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. INIBIOMA CONICET UNCo Lab Fotobiol RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. RA-8400 Argentina INIBIOMA CONICET UNCo, Lab Fotobiol, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. NA NA NA WOS:000285339300008 7 J 2010 NA
40 Milano, Daniela 0.92 4 NA NA NA No Affiliation NA NA NA NA NA NA WOS:000187549700010 4675 J 2004 NA

It is much easier to compare author names and eval if they are the same person in the the info is side=by-side: It allows you to compare the name variants and their similarity score, then compare other information about them. Comparing is easier when cells are adjacent to each other (as opposed to above/below), and by eliminating the columns with less useful info it makes the table easier on the eyes. This wide format also cuts down on the number of rows.


Name-1 Name 2 similarity authorID-1 authorID-2 proposed groupID country-1 country-2 address-1 address-2 RI-1 RI-2 OI-1 OI-2 refID-1 refID-2
Lozada, Mariana Lozada, M 0.92 39 5303 39 Argentina Argentina INIBIOMA CONICET UNCo, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. INIBIOMA, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.         7 1139
Milano, Daniela Milano, D 0.92 4 19776 4 Argentina NA INIBIOMA CONICET UNCo, Lab Fotobiol, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. NA NA NA NA NA 7 4675

It will take some moving around and renaming columns, then reorganizing again to merge with refine_references, but I think it should be pretty quick. What do you think?

aurielfournier commented 6 years ago

I think this sounds quite feasible. I'll play around with it and see how simple it would be to implement.

aurielfournier commented 6 years ago

So limiting the number of columns is super easy, and that is taken care of.

But I'm running into some issues with moving the data into a wider form.

Its kind of tricky to do when there are only 2 rows per groupID [so only two papers per possible author]

But once there are more then 2, It gets super messy, as you need to have a new wide row for each possible combo, and that, gets very tricky.

I wonder if instead of creating this wider form, we could fill in the rows like below, where we can fill in the match_name and similarity index NAs, with the possible matches, and that would help aid the user in evaluating those matches?

Thoughts @embruna ?

groupID match_name similarity author_order address university department short_address postal_code country RP_address RI OI EM UT refID PT PY PU
39 Lozada, Mariana 0.92 2 INIBIOMA CONICET UNCo, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. INIBIOMA CONICET UNCo Lab Ecotono RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. RA-8400 Argentina NA NA NA NA WOS:000285339300008 7 J 2010 NA
39 Lozada, Mariana 0.92 1 INIBIOMA, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. INIBIOMA Lab Ecotono RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. RA-8400 Argentina INIBIOMA, Lab Ecotono, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. NA NA mlozada@crub.uncoma.edu.ar WOS:000266655800033 1139 J 2009 NA
40 Milano, Daniela 0.92 1 INIBIOMA CONICET UNCo, Lab Fotobiol, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. INIBIOMA CONICET UNCo Lab Fotobiol RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. RA-8400 Argentina INIBIOMA CONICET UNCo, Lab Fotobiol, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. NA NA NA WOS:000285339300008 7 J 2010 NA
40 Milano, Daniela 0.92 4 NA NA NA No Affiliation NA NA NA NA NA NA WOS:000187549700010 4675 J 2004 NA
embruna commented 6 years ago

I think that will have to do.

aurielfournier commented 6 years ago

excellent. I'll get it built into the package more formally this weekend.

embruna commented 6 years ago

Hey, can we talk this over just to make sure? I still have some doubts, but think they will be cleared up quicker over the phone. Let's circulate a common file to look over and discuss.