Closed embruna closed 6 years ago
I think this sounds quite feasible. I'll play around with it and see how simple it would be to implement.
So limiting the number of columns is super easy, and that is taken care of.
But I'm running into some issues with moving the data into a wider form.
Its kind of tricky to do when there are only 2 rows per groupID [so only two papers per possible author]
But once there are more then 2, It gets super messy, as you need to have a new wide row for each possible combo, and that, gets very tricky.
I wonder if instead of creating this wider form, we could fill in the rows like below, where we can fill in the match_name and similarity index NAs, with the possible matches, and that would help aid the user in evaluating those matches?
Thoughts @embruna ?
groupID | match_name | similarity | author_order | address | university | department | short_address | postal_code | country | RP_address | RI | OI | EM | UT | refID | PT | PY | PU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
39 | Lozada, Mariana | 0.92 | 2 | INIBIOMA CONICET UNCo, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | INIBIOMA CONICET UNCo | Lab Ecotono | RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | RA-8400 | Argentina | NA | NA | NA | NA | WOS:000285339300008 | 7 | J | 2010 | NA |
39 | Lozada, Mariana | 0.92 | 1 | INIBIOMA, Lab Ecotono, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | INIBIOMA | Lab Ecotono | RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | RA-8400 | Argentina | INIBIOMA, Lab Ecotono, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | NA | NA | mlozada@crub.uncoma.edu.ar | WOS:000266655800033 | 1139 | J | 2009 | NA |
40 | Milano, Daniela | 0.92 | 1 | INIBIOMA CONICET UNCo, Lab Fotobiol, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | INIBIOMA CONICET UNCo | Lab Fotobiol | RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | RA-8400 | Argentina | INIBIOMA CONICET UNCo, Lab Fotobiol, Quintral 1250, RA-8400 San Carlos De Bariloche, Rio Negro, Argentina. | NA | NA | NA | WOS:000285339300008 | 7 | J | 2010 | NA |
40 | Milano, Daniela | 0.92 | 4 | NA | NA | NA | No Affiliation | NA | NA | NA | NA | NA | NA | WOS:000187549700010 | 4675 | J | 2004 | NA |
I think that will have to do.
excellent. I'll get it built into the package more formally this weekend.
Hey, can we talk this over just to make sure? I still have some doubts, but think they will be cleared up quicker over the phone. Let's circulate a common file to look over and discuss.
I've been thinking about how to simplify the presentation of the information in the file users need to review to decide if disambiguation was done correctly. The following suggestions are based on lots of personal experience reviewing tables like this - you want to simplify as much as possible and keep only high information cells.
First, I suggest we delete the following columns from foo_authors and foo_authors.csv
2) second, i suggest we rearrange to simplify comparison.
The comparison currently has names in long format:
It is much easier to compare author names and eval if they are the same person in the the info is side=by-side: It allows you to compare the name variants and their similarity score, then compare other information about them. Comparing is easier when cells are adjacent to each other (as opposed to above/below), and by eliminating the columns with less useful info it makes the table easier on the eyes. This wide format also cuts down on the number of rows.
It will take some moving around and renaming columns, then reorganizing again to merge with refine_references, but I think it should be pretty quick. What do you think?