simplify output of foo_authors to simplify manual verification

embruna commented 6 years ago

I've been thinking about how to simplify the presentation of the information in the file users need to review to decide if disambiguation was done correctly. The following suggestions are based on lots of personal experience reviewing tables like this - you want to simplify as much as possible and keep only high information cells.

First, I suggest we delete the following columns from foo_authors and foo_authors.csv

PU
Author order
PT
PY
UT or refID (only need one, refID is shorter so maybe keep it)
RP address (the reprint email address is not always the address of the author being 2x)
university
department
short address
postal code

2) second, i suggest we rearrange to simplify comparison.

The comparison currently has names in long format:

groupID	match_name	similarity	author_order	address	university	department	short_address	postal_code	country	RP_address	RI	OI	EM	UT	refID	PT	PY	PU
39	NA	NA	2	INIBIOMA CONICET UNCo, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	INIBIOMA CONICET UNCo	Lab Ecotono	RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	RA-8400	Argentina	NA	NA	NA	NA	WOS:000285339300008	7	J	2010	NA
39	Lozada, Mariana	0.92	1	INIBIOMA, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	INIBIOMA	Lab Ecotono	RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	RA-8400	Argentina	INIBIOMA, Lab Ecotono, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	NA	NA	mlozada@crub.uncoma.edu.ar	WOS:000266655800033	1139	J	2009	NA
40	NA	NA	1	INIBIOMA CONICET UNCo, Lab Fotobiol, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	INIBIOMA CONICET UNCo	Lab Fotobiol	RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	RA-8400	Argentina	INIBIOMA CONICET UNCo, Lab Fotobiol, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	NA	NA	NA	WOS:000285339300008	7	J	2010	NA
40	Milano, Daniela	0.92	4	NA	NA	NA	No Affiliation	NA	NA	NA	NA	NA	NA	WOS:000187549700010	4675	J	2004	NA

It is much easier to compare author names and eval if they are the same person in the the info is side=by-side: It allows you to compare the name variants and their similarity score, then compare other information about them. Comparing is easier when cells are adjacent to each other (as opposed to above/below), and by eliminating the columns with less useful info it makes the table easier on the eyes. This wide format also cuts down on the number of rows.

Name-1	Name 2	similarity	authorID-1	authorID-2	proposed groupID	country-1	country-2	address-1	address-2	RI-1	RI-2	OI-1	OI-2	refID-1	refID-2
Lozada, Mariana	Lozada, M	0.92	39	5303	39	Argentina	Argentina	INIBIOMA CONICET UNCo, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	INIBIOMA, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.					7	1139
Milano, Daniela	Milano, D	0.92	4	19776	4	Argentina	NA	INIBIOMA CONICET UNCo, Lab Fotobiol, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	NA	NA	NA	NA	NA	7	4675

It will take some moving around and renaming columns, then reorganizing again to merge with refine_references, but I think it should be pretty quick. What do you think?

aurielfournier commented 6 years ago

I think this sounds quite feasible. I'll play around with it and see how simple it would be to implement.

aurielfournier commented 6 years ago

So limiting the number of columns is super easy, and that is taken care of.

But I'm running into some issues with moving the data into a wider form.

Its kind of tricky to do when there are only 2 rows per groupID [so only two papers per possible author]

But once there are more then 2, It gets super messy, as you need to have a new wide row for each possible combo, and that, gets very tricky.

I wonder if instead of creating this wider form, we could fill in the rows like below, where we can fill in the match_name and similarity index NAs, with the possible matches, and that would help aid the user in evaluating those matches?

Thoughts @embruna ?

groupID	match_name	similarity	author_order	address	university	department	short_address	postal_code	country	RP_address	RI	OI	EM	UT	refID	PT	PY	PU
39	Lozada, Mariana	0.92	2	INIBIOMA CONICET UNCo, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	INIBIOMA CONICET UNCo	Lab Ecotono	RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	RA-8400	Argentina	NA	NA	NA	NA	WOS:000285339300008	7	J	2010	NA
39	Lozada, Mariana	0.92	1	INIBIOMA, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	INIBIOMA	Lab Ecotono	RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	RA-8400	Argentina	INIBIOMA, Lab Ecotono, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	NA	NA	mlozada@crub.uncoma.edu.ar	WOS:000266655800033	1139	J	2009	NA
40	Milano, Daniela	0.92	1	INIBIOMA CONICET UNCo, Lab Fotobiol, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	INIBIOMA CONICET UNCo	Lab Fotobiol	RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	RA-8400	Argentina	INIBIOMA CONICET UNCo, Lab Fotobiol, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina.	NA	NA	NA	WOS:000285339300008	7	J	2010	NA
40	Milano, Daniela	0.92	4	NA	NA	NA	No Affiliation	NA	NA	NA	NA	NA	NA	WOS:000187549700010	4675	J	2004	NA

embruna commented 6 years ago

I think that will have to do.

aurielfournier commented 6 years ago

excellent. I'll get it built into the package more formally this weekend.

embruna commented 6 years ago

Hey, can we talk this over just to make sure? I still have some doubts, but think they will be cleared up quicker over the phone. Let's circulate a common file to look over and discuss.

ropensci / refsplitr

simplify output of foo_authors to simplify manual verification #40