Submission: refnet - Githubissues

aurielfournier commented 5 years ago

Summary

What does this package do? (explain in 50 words or less): refnet is a package to read, organize, geocode, analyze, and visualize Clarivate Web of Knowledge/Web of Science, format reference data files for scientometric, social network, and Science of Science analyses.
Paste the full DESCRIPTION file inside a code block below:

Package: refnet
Type: Package
Title: Thomson Reuters Web of Knowledge/Science and ISI Reference Data Tools
Version: 0.6
Date: 2018-08-26
Authors@R: c(person("Auriel M.V. Fournier", "Developer", role = c("aut"),
                     email = "aurielfournier@gmail.com"),
              person("Forrest R. Stevens", "Developer", role = "aut"),
              person("Matthew E. Boone", "Developer", role = "aut"),
              person("Emilio Bruna", "Developer", role=c("aut","cre"), 
              email="embruna@ufl.edu"))
Maintainer: Emilio Bruna <embruna@ufl.edu>
Description: This function reads Thomson Reuters Web of Knowledge/Science and ISI format reference data files into an R friendly data format and can optionally write the converted data to a friendly CSV format.
License: GPL-3
Imports: maptools, maps, rworldmap, RecordLinkage, Matrix, igraph, network, sna, Hmisc, ggplot2, stringi, stringr, ggmap, Rdpack, tidyr, dplyr, tibble
RoxygenNote: 6.1.0
RdMacros: Rdpack
Suggests: testthat, utils
VignetteBuilder: utils
Encoding: UTF-8

URL for the package (the development repository, not a stylized html page): https://github.com/embruna/refnet
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

Data extraction and munging, since it takes data from one format, and transforms it into something that is useful, and also matches up records among authors.

[Note, the link for the package fit, does not lead to that page anymore, and I couldn't find anything about package fit in the linked policies]

Who is the target audience and what are scientific applications of this package?

Scientists interested in studying the networks of a particular author, subject area or journal.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

https://github.com/ropensci/onboarding/issues/247

Requirements

Confirm each of the following by checking the box. This package:

[X] does not violate the Terms of Service of any service it interacts with.
[X] has a CRAN and OSI accepted license.
[X] contains a README with instructions for installing the development version.
[X] includes documentation with examples for all functions.
[X] contains a vignette with examples of its essential functions and uses.
[X] has a test suite.
[X] has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
[X] I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

[yes] Do you intend for this package to go on CRAN?
[yes] Do you wish to automatically submit to the Journal of Open Source Software? If so:
- [X] The package has an obvious research application according to JOSS's definition.
- [X] The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- [not yet] The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
[no] Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- [ ] The package is novel and will be of interest to the broad readership of the journal.
- [ ] The manuscript describing the package is no longer than 3000 words.
- [ ] You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

[yes] Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:
[yes] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

Heather Piwowar @hpiwowar

sckott commented 5 years ago

Thanks very much for your submission @aurielfournier - we're discussing now and will get back to you soon

maelle commented 5 years ago

Thanks for your submission @aurielfournier! I see the package doesn't have any test and doesn't have continuous integration yet. I suggest we put the submission on hold while you sort that, unless you can and want to add this within a week or so? There is some guidance in this guide, and I am happy to answer any question here or via Slack!

I was also looking at the dependencies, there are many of them in DESCRIPTION and

at least one doesn't seem used at all in the package (stringi)
it seems you have written the NAMESPACE by hand? We'd prefer you to use @importFrom (or even just pkg::fun when calling the function) and @export tags in R files. I wouldn't recommend importing whole packages unless needed. See http://r-pkgs.had.co.nz/namespace.html#imports

maelle commented 5 years ago

@aurielfournier for info I've just added the holding label, please update this thread once you have had time to work on the package, and ask me any question.

aurielfournier commented 5 years ago

Thanks @maelle I and my co-authors are working on it, but its taken a bit longer then we expected. Appreciate your patience!

maelle commented 5 years ago

No problem, and happy to help if I/we can!

aurielfournier commented 5 years ago

@maelle package now has tests and continuous integration.

I removed stringi from the DESCRIPTION file.

I have fixed the issues from the NAMESPACE file.

Huge thanks to my coauthor @birderboone for doing the heavy lifting to get this over the finish line!

I think we are ready for review now. If you have any other things that need to be addressed let me know.

Thanks!

maelle commented 5 years ago

:wave: @aurielfournier @birderboone! Awesome, thanks to both of you! A few comments before I do the last editor checks:

Could you please add the Travis badge at the top of the README?
The Travis build is failing. To fix this (both WARNING and NOTEs)
- "Undocumented arguments in documentation object 'references_read' ‘include_all’" Update the docs
- prefix utils functions e.g. utils::flush.console()
- For some variables you might need to create such a file: https://github.com/ropensci/opencage/blob/master/R/globalVariables.R to define them as global variables (the ones that appear in NOTEs). Unless they come up in dplyr function in which case you can write e.g. mutate(df, y = .data$a + .data$x) to make the NOTE disappear see this vignette
- Reg JOSS in rtimicropem I put the paper in a paper/ folder https://github.com/ropensci/rtimicropem/tree/master/paper and buildignored it usethis::use_build_ignore("paper/") or something like that.
Once the Travis build passes please add a code coverage report. Run usethis::use_coverage("codecov") which will give you stuff to add to the Travis config file, browse codecov maybe (I can't remember) and give you the code to paste in the README to get a badge.
I see two issues about postal codes (leading zeros, Brazil code), do they need to be fixed before review?
Regarding the other issues I admire the naming scheme. Just in case you ignored that, GitHub also has a milestone feature: https://help.github.com/articles/about-milestones/
Please add a code of conduct and contributing guide (this can be extremely short) https://ropensci.github.io/dev_guide/collaboration.html#friendlyfiles No need to add the PR and issue templates yet.

aurielfournier commented 5 years ago

Hi @maelle

moved the badge

Addressed the Travis warnings/notes

I closed the two open issues. thanks for pointing out the milestones, I had forgotten about that.

I added the CoC and contribution guides.

Thank you so much for all the links and tips on how to address these issues, it is greatly appreciated.

Build is now passing!!

maelle commented 5 years ago

Yay, green badge! Can you also add a coverage badge? Run usethis::use_coverage("codecov") which will give you stuff to add to the Travis config file, browse codecov maybe (I can't remember) and give you the code to paste in the README to get a badge.

aurielfournier commented 5 years ago

Done! Sorry I missed that.

maelle commented 5 years ago

Thank you! A few more things before I search for reviewers (then they and you have less work :wink:).

[x] Fit: The package meets criteria for fit and overlap
[x] Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
[x] License: The package has a CRAN or OSI accepted license
[x] Repository: The repository link resolves correctly
[ ] Archive (JOSS only, may be post-review): The repository DOI resolves correctly
[ ] Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

goodpractice output

✖ write short and simple functions. These
    functions have high cyclomatic complexity:authors_clean
    (68).

Maybe you can split it in several helper functions?

  ✖ omit "Date" in DESCRIPTION. It is not required
    and it gets invalid quite often. A build date will be
    added to the package when you perform `R CMD build` on it.

  ✖ add a "URL" field to DESCRIPTION. It helps users
    find information about your package online. If your
    package does not have a homepage, add an URL to GitHub, or
    the CRAN package package page.
  ✖ add a "BugReports" field to DESCRIPTION, and
    point it to a bug tracker. Many online code hosting
    services provide bug trackers for free,
    https://github.com, https://gitlab.com, etc.

Run usethis::use_github_links().

  ✖ avoid long code lines, it is bad for
    readability. Also, many people prefer editor windows that
    are about 80 characters wide. Try make your lines shorter
    than 80 characters

    R\authors_clean.R:24:1
    R\authors_clean.R:26:1
    R\authors_clean.R:29:1
    R\authors_clean.R:35:1
    R\authors_clean.R:36:1
    ... and 162 more lines

It's the complicated function, one more reason to try and simplify it?


  ✖ omit trailing semicolons from code lines. They
    are not needed and most R coding standards forbid them

    R\authors_refine.R:20:198


  ✖ avoid sapply(), it is not type safe. It might
    return a vector, or a list, depending on the input data.
    Consider using vapply() instead.

    R\authors_clean.R:97:17
    R\authors_clean.R:132:19
    R\authors_clean.R:152:19
    R\authors_clean.R:186:17
    R\authors_clean.R:223:22
    ... and 14 more lines

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...),
    1:NROW(...) and 1:NCOL(...) expressions. They are error
    prone and result 1:0 if the expression on the right hand
    side is zero. Use seq_len() or seq_along() instead.

    R\authors_clean.R:56:15
    R\authors_clean.R:71:79
    R\authors_clean.R:188:15
    R\authors_clean.R:209:21
    R\authors_clean.R:422:12
    ... and 9 more lines

  ✖ avoid 'T' and 'F', as they are just variables
    which are set to the logicals 'TRUE' and 'FALSE' by
    default, but are not reserved words and hence can be
    overwritten by the user.  Hence, one should always use
    'TRUE' and 'FALSE' for the logicals.

    R/authors_clean.R:NA:NA
    R/authors_clean.R:NA:NA
    R/authors_clean.R:NA:NA
    R/authors_clean.R:NA:NA
    R/authors_clean.R:NA:NA
    ... and 38 more lines

Can you please re-trigger a Travis build so that the coverage badge indicate a coverage? We're aiming at a minimal coverage of 75% see https://github.com/ropensci/dev_guide/pull/94/files (brand-new official guidance)
I'd recommend putting all badges on a single line.
Could you also use Appveyor CI for Windows? usethis::use_appveyor(). This will add another badge.
You can add the in-review badge

[![](https://badges.ropensci.org/256_status.svg)](https://github.com/ropensci/onboarding/issues/256)

It'll turn green when your package is approved.

Could you please add examples in the documentation of the functions?
Running devtools::spell_check() shows a few typos among the false negatives: querrying, nunmbers etc.

Reviewers: @njahn82 @bmkramer Due date: 2018-12-12

aurielfournier commented 5 years ago

Hi @maelle

We are going to pause, and redo authors_clean() to be simpler/broken down into several functions. This will probably take ~ 1 week.

Thanks!

maelle commented 5 years ago

Ok, thank you!

aurielfournier commented 5 years ago

Alright! After some fighting with Travis the past 24 hours, we are good to go.

@birderboone split up authors_clean into three smaller internal functions, that should make review of the code easier. We've also addressed the other comments from @maelle

If I missed something, let me know.

Thanks!

maelle commented 5 years ago

Thanks @aurielfournier and @birderboone!

A few more things from goodpractice to tackle before I look for reviewers

It is good practice to

  ✖ add a "BugReports"
    field to DESCRIPTION, and point
    it to a bug tracker. Many
    online code hosting services
    provide bug trackers for free,
    https://github.com,
    https://gitlab.com, etc.

Simply run usethis::use_github_links()

  ✖ use '<-' for
    assignment instead of '='. '<-'
    is the standard, and R users
    and developers are used it and
    it is easier to read your code
    for them if you use '<-'.

    tests\testthat\test_authors_match.R:4:5

The styler package might help.


  ✖ avoid long code
    lines, it is bad for
    readability. Also, many people
    prefer editor windows that are
    about 80 characters wide. Try
    make your lines shorter than 80
    characters

    R\authors_address.R:12:1
    R\authors_address.R:14:1
    R\authors_address.R:17:1
    R\authors_address.R:20:1
    R\authors_address.R:41:1
    ... and 174 more lines


  ✖ avoid sapply(), it is
    not type safe. It might return
    a vector, or a list, depending
    on the input data. Consider
    using vapply() instead.

    R\plot_net_address.R:32:26
    R\plot_net_address.R:33:26
    tests\testthat\test_references_read.R:10:17

  ✖ avoid 1:length(...),
    1:nrow(...), 1:ncol(...),
    1:NROW(...) and 1:NCOL(...)
    expressions. They are error
    prone and result 1:0 if the
    expression on the right hand
    side is zero. Use seq_len() or
    seq_along() instead.

    R\authors_georef.R:55:25
    R\authors_georef.R:71:15
    R\authors_georef.R:113:17
    R\plot_net_address.R:34:35
    R\plot_net_address.R:123:22
    ... and 1 more lines

  ✖ fix this R CMD check
    NOTE: Namespaces in Imports
    field not imported from:
    'Rdpack' 'maps' 'stringr' All
    declared Imports should be
    used.

  ✖ avoid 'T' and 'F', as
    they are just variables which
    are set to the logicals 'TRUE'
    and 'FALSE' by default, but are
    not reserved words and hence
    can be overwritten by the user.
    Hence, one should always use
    'TRUE' and 'FALSE' for the
    logicals.

    R/authors_address.R:NA:NA
    R/authors_address.R:NA:NA
    R/authors_address.R:NA:NA
    R/authors_georef.R:NA:NA
    R/authors_georef.R:NA:NA
    ... and 15 more lines

And from me: could you please add a coverage badge? usethis::use_coverage() should help you with that.

Thanks in advance and thanks for all your work until now! 😸

aurielfournier commented 5 years ago

Hi @maelle

Thanks as always for your kind patience. It is greatly appreciated.

I've addressed all of the above, and I finally downloaded goodpractice for myself to check things.

The one issue that I don't totally understand, but isn't throwing an issues in goodpractice is this one

fix this R CMD check NOTE: Namespaces in Imports field not imported from: 'Rdpack' 'maps' 'stringr' All declared Imports should be used.

I thought that meant that I needed to remove Rdpack, maps and stringr from the DESCRIPTION file. So I did, but then the build failed, and it did not pass unless I put Rdpack back in.

But otherwise I think we're ok. :D

maelle commented 5 years ago

Thanks a lot @njahn82 @bmkramer for accepting to review this package! 😺 Your reviews are due on 2018-12-12.

As a reminder, our reviewer guide can be found here and the review template here.

maelle commented 5 years ago

:wave: @njahn82 @bmkramer! Friendly reminder that your reviews are due in two days, on 2018-12-12. 😺

njahn82 commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[X] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[X] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[X] Installation instructions: for the development version of package and any non-standard dependencies in README
[ ] Vignette(s) demonstrating major functionality that runs successfully locally
[X] Function Documentation: for all exported functions in R help
[ ] Examples for all exported functions in R Help that run successfully locally
[ ] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[ ] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[ ] A short summary describing the high-level functionality of the software

[ ] Authors: A list of authors with their affiliations

[ ] A statement of need clearly stating problems the software is designed to solve and its target audience.

[ ] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

[X] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed.
[X] Performance: Any performance claims of the software been confirmed.
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6 hours

Review Comments

This is a specific package used for manipulating and analyzing authorship data from the Web of Science (WoS), a large toll-access literature and citation database indexing articles from around 12.000 academic journals. The packages imports local files that needs to be manually downloaded from the database. This is a quite common workflow when re-using WoS data, because API access is very costly and limited.

I was very excited to see that refnet addresses the problem of author disambiguation and affiliation extraction using WoS data. As a data analyst for scholarly communication at a research library, I sometimes create co-authorship networks. For this task, I often use Web of Science data. It is very laborious to parse the different text strings representing authors and institutions and to disambiguate them. I especially like that refnet supports a workflow where automatic and manual cleaning steps are supported.

Unfortunately, I had a hard time to get started with the package, because it took me a while to find information about what WoS data export format was needed, and how to load the data into R using the package.

After downloading the data, my first attempts loading the file into R failed:

library(refnet)
my_data <- references_read(data = "wos_ropensci.txt")

## Error in references_read(data = "wos_ropensci.txt"): ERROR:  The specified file or directory does not contain any 
##          Web of Knowledge or ISI Export Format records!

It took me a while (and many manual downloads from the WoS) to realize, that the param dir needs to be set to FALSE when I want to load just one file.

my_data <- references_read(data = "wos_ropensci.txt", dir = FALSE)

I feel that the average R user is not as patient when appropriate starting instructions are missing. My main request as reviewer would be therefore to improve high-level documentation, as well as to provide a sample dataset to play with.

I suggest expanding the README and to present an overview and some details in a refnet-package.Rd file, which is currently missing, so that users can type ?refnet-package for help.

Here are some other observations and suggestions that might helpful for improving the package.

Runnable documentation

Although the long-form documentation nicely explains the motivation and the workflow, it seems that the vignette does not process code chunks with functions from the package. I would suggest to add executable examples to successfully demonstrate to the users what can be done with the package. It would also be helpful to include an Rmarkdown file used to generate README.md with at least one runnable example.

Installation and Building

Installed easily, but it does not passed R CMD full check with --as-cran . There were two Errors and two Notes:

Two Errors

Conflicting package names (submitted: refnet, existing: RefNet [https://bioconductor.org/packages/3.7/bioc])

https://www.bioconductor.org/packages/release/bioc/html/RefNet.html

Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
    |                                                                      |   0%
    |                                                                            
    |======================================================================| 100%[1] "Now processing all references files"
  [1] "Now processing all references files"

    |                                                                            
    |                                                                      |   0%
    |                                                                            
    |======================================================================| 100%══ testthat results  ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  OK: 66 SKIPPED: 0 FAILED: 2
  1. Failure: Net plots work (@test_plots.R#95) 
  2. Failure: Net plots work (@test_plots.R#96)

Two Notes

* checking DESCRIPTION meta-information ... NOTE
Maintainer field differs from that derived from Authors@R
  Maintainer: ‘Emilio Bruna <embruna@ufl.edu>’
  Authors@R:  ‘Emilio Bruna Developer <embruna@ufl.edu>’

* checking top-level files ... NOTE
Non-standard file/directory found at top level:
  ‘missing_addresses.csv’

As described in the Check logs, there is a conflicting Bioconductor package with the same name: RefNet. To comply with rOpenSci and CRAN, a new package name is needed. A name where also the database name is included could help users that want to work with Web of Science data files to discover the package.

The package has a test suite for main functions, which succeeds in RStudio, but not when checking the package bundle (see R CMD check output)

It seems that the Authors@R documentation in the Writing R Extensions is misleading because in this package "Developer" is used as family name in the author names field as well, probably explaining why the author and maintainer field differs. Family names instead of "Developer" needs to be added to the Authors@R vector.

missing_addresses.csv needs to be passed to .RBuildignore, or removed when not needed.

In the Documentation, the brief "About" refers to Thomson Reuters as company behind the Web of Science. Ownership changed recently to Clarivate Analytics.

Tests

The package uses automatic testing, which is great. Tests could be expanded to cover more functionalities. For instance, authors_georef() does not check geo-coding using Google Maps.

While testing data export functionalities, files are written into the testthat folder. I would suggest to avoid this behaviour using unlink() after the tests. Here's an example how to use unlink from the rio package.

Functions

Main functions have many lines, which makes it very hard to follow what is going on. It would be great, if these functions could be split into smaller units.

references_read() seems to contain a lot of repeated code to import data as data.frame. I wonder if the WoS csv export file format could be used instead of the Plain Text format? When the data is rectangular, the readr package has great functionalities to strip out whitespace, which takes much room in the function, and to define colClasses while loading files into R.

When importing data with references_read(), values in many columns end with a line break \n.

Some console messages are invoked by using the print() method (see https://github.com/embruna/refnet/search?l=R&q=print+%2A.R). To enable user-friendly surpression, message() and warning() can be used instead.

There are various issues when checking the code syntax with lintr::lint() that needs to adressed.

Documentation of functions can be improved by making more use of roxygen2 tags. Not all functions have examples. Internal functions should be tagged with @noRd to avoid that they are added to the manual.

Use of functions from other packages

The use of functions from other packages could be made more explicit to the users. In many cases, it is not possible to interact with them.

authors_georef(), for example, uses ggmap::geocode to retrieve geo-coding information. Since a couple of weeks, however, keyless access to Google Maps Platform has been deprecated. Information about how to pass API keys to the function to make geocode work would be very helpful.

Functions used to visualize the networks make use ggplot2. It would be great to interact with its functionalities when calling the refnet functions.

To improve documentation of external functions, helpful tags roxygen2 like @importFrom and @inheritParams should be considered.

Maintainability

Overall, it seems that package has quite a history, and I welcome updating it. However, because of the ambiguity of author names and addresses in general, and the complicated WoS data format in particular, I wonder if more focus would improve the maintainability of the package.

One strategy could be the usage of tidyverse packages and functions. At least, they would help to dry out code for loading the data and string manipulation in a tidy way. Of course, the package would have to start with importing rectangular data and not the field tags format, which is now used.

Another would be the focus on parsing and transforming the authorship data including affiliations stored in the C1 field. Developing functions used to visualize the networks, however, could be discontinued in favor of long-form documentations, and in favor of data formats supported by Social Network Analysis packages and software.

I think that's it from me! Happy to help further with the process!

maelle commented 5 years ago

Thanks a lot for your review @njahn82! 😺

bmkramer commented 5 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

[x] As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

[x] A statement of need clearly stating problems the software is designed to solve and its target audience in README
[x] Installation instructions: for the development version of package and any non-standard dependencies in README
[x] Vignette(s) demonstrating major functionality that runs successfully locally
[x] Function Documentation: for all exported functions in R help
[x] Examples for all exported functions in R Help that run successfully locally
[x] Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

[x] The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

[x] A short summary describing the high-level functionality of the software

[x] Authors: A list of authors with their affiliations

[x] A statement of need clearly stating problems the software is designed to solve and its target audience.

[] References: with DOIs for all those that have one (e.g. papers, datasets, software).

Comments: DOIs missing from a number of references; 'Larivière' spelled as 'Lariviare' on two occasions; figure not displaying correctly

Functionality

[x] Installation: Installation succeeds as documented.
[ ] Functionality: Any functional claims of the software been confirmed. See comments listed below
[ ] Performance: Any performance claims of the software been confirmed. See comments listed below
[ ] Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine. Comments:not done automated tests
[ ] Packaging guidelines: The package conforms to the rOpenSci packaging guidelines Comments: not feel qualified to assess

Final approval (post-review)

[ ] The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6 hours

Review Comments

General:

I really like the ability of this package to extract author and address information from WoS records. The main functionalities of the package worked (importing results, author name parsing and disambiguation and georeferencing) worked reasonably well for me. The visualization functions I could not get to work.

Information on functionality provided in vignette is detailed and complete; it would be good though to duplicate some of this (e.g. the examples) in Readme.md to get people started on how to use the package.

I have focused my review on functionality of the package. Comments below are based on performing all tasks as described in the vignette, with a custom test dataset downloaded from Web of Science (first 500 articles from PeerJ in 2018)

1 Introduction In describing the package, it is mentioned that the processed data-sets can be exported in tidy formats for more in-depth analyses with other packaged. It is not mentioned in what format export is useful for other packages. Perhaps this is self-explanatory (the csv-outputs provided), but it would be good to specify.

2.0 Using Refnet

could not find test data ‘example_data.txt'
refnet_fig1.jpg in Appendix 1 is very low res
For WoS export: in text it is mentioned that both .txt and .ciw formats can be processed, but the worked example in Appendix 1 only shows how to export as .txt. Would be good to harmanize, and perhaps explain in text that .ciw is format for Endnote export
In my tests, export does not need to be via Marked list, but also works via download menu in search results, either as Endnote export (.ciw) or as 'Other file formats', with 'full record' and 'plain text' selected. This works much faster than via Marked lists.

2.1 Importing Search Results

In references_read(), what is the default for dir=T/F? args(references_read) says it's TRUE
Comments on vignette text: -- typo in example a): .txr -- c) is not a separate example -- remark in Appendix 2 on fields only included when all.fiels=T is included in references_read() should be included in main text describing references_read()
testing all.fields=T results in error: unused argument (all.fields=T) args(references_read) reveals it should be include_all=FALSE

2.2 Author address parsing and name disambiguation

Function authors_clean(): no csv file saved. Function contains argument write_out_data = FALSE. Tried TRUE => 2 files saved (authors_review, authors_prelim)
Function also contains argument sim_score (value 0.88) - this is not explained in the documentation (it is mentioned for authors_refine where it has a NULL value)
In documentation under 2.2.2, -- reference is made to Appendix 2, this should be Appendix 3. -- it is stated: 'Users that prefer to manually review the results of the disambiguation can do so with the “authors” object and .csv files' -> unclear which of the 2 csv files (prelim or review) should be taken (I assume review from the documentation of the next step. Also: 'authors' object is unclear)
"Corrections made to the “review” file are merged into the “preview” file"-> should be "prelim"
In Appendix 3: -- explanation on author name disambiguationis informative and useful. It does have some spelling and style issues, not critical, but could do with a careful edit -- kudos for encouraging authors to sign up for (and use!) ORCID! -- not covered: sim_score -- layout of table 2 (2 columns) is mangled -- In table 2, similarity is listed as NA, but it has a value in my test data

2.3 Georeferencing author institutions

Example for authors_georef is incomplete example_georef <-authors_georef(------,------,-----)
Explanations of arguments is incomplete: function (data, address_column = "address", filename_root = "", write_out_missing = TRUE, retry_limit = 10)

-> address_column has value -> retry limit not discussed

excuting authors_georef() resulted in lots of error messages, with a loop at the end and a message that a number of geocoding queries are remaining -> if this is expected behaviour, would be good to address in documentation
In documentation: not clear when which geocoding application is used when (sequentially?). http://www.datasciencetoolkit.org/ and/or https://developers.google.com/maps/documentation/.
In documentation, it is stated 'an output/file of references that refnet was unable to georeference, which the user can review, manually correct, and import back into the file of georeferenced author locations -> file seems to contain all lines (with and without lat/long resolved) -> unclear how 'import back into file' should be performed

2.4. Data Visualization: Productivity and Collaboration

I could not get these to work, see error messages (and some analysis on them) below.

Error in plot_addresses_country(PeerJ_2018_2_georef, filename_root = "./output/PeerJ_2018_2") : unused argument (filename_root = "./output/PeerJ_2018_2")
args(plot_addresses_country) function (data, mapRegion = "world") -> so no argument for filename_root as in documentation
Error in rworldmap::joinCountryData2Map(country_name_table, joinCode = "NAME", : your chosen nameJoinColumn :'country_name' seems not to exist in your data, columns = Freq
plot_addresses_points: also no argument for filename_root as in documentation
Example from plot_net_coauthor() is incorrect: plot_addresses_points <- plot_addresses_points(data, filename_root="./output/example")

plot_net_coauthor_2 <- plot_net_coauthor(PeerJ_2018_2_georef) Error in data[!is.na(data$country), ] : incorrect number of dimensions

plot_net_country_2 <- plot_net_country(PeerJ_2018_2_georef) Error in data[!is.na(data$country), ] : incorrect number of dimensions

args(plot_net_country) function (data, line_resolution = 10, mapRegion = "world") -> so no argument for filename_root as in documentation
Error in plot_net_addresses(PeerJ_2018_2_georef) : could not find function "plot_net_addresses"

bmkramer commented 5 years ago

With apologies for the late review!

maelle commented 5 years ago

Thanks a lot for your review @bmkramer! :smile_cat:

Reg "The package conforms to the rOpenSci packaging guidelines" the question is whether you see any discrepancy between https://ropensci.github.io/dev_guide/building.html and the package, if you have time you are qualified to assess, and you can ask me any question.

Was time or another problem the reason for not running tests? Happy to help if needed (well I can't help with time :smile: ).

Thanks a lot for your feedback in any case!

maelle commented 5 years ago

@aurielfournier @birderboone now both reviews are in! :tada:

aurielfournier commented 5 years ago

Just a note to all involved that @birderboone and I are working on the edits (huge thanks to the reviewers!), we're just a bit slowed down by other things at the moment, but we should have everything addressed by February 5th. Thanks for your patience!

aurielfournier commented 5 years ago

Thanks to the reviewers (@bmkramer , @njahn82) to providing several useful links to resource that made addressing their comments much easier! Your comments were very helpful and constructive and the package is much better off for them, we really appreciate your time!

First, if you look at the repo, you will see that build is failing. This is because all of this happened, and basically if we were able to get travis to use the github version of ggmap, everything would be fine, but until those changes are on CRAN, the travis build will fail. Since our response was due today, and other then this issue we're ready for you all to look at it again I'm tossing the ball over to your side of the court. If you would like to wait till ggmap on CRAN is updated, and the build passes, that is fine by us.

Below is our response broadly to the reviewer comments, if you would prefer a comment by comment response, let me know and I'm happy to do that. Thanks!

Auriel, Matt and Emilio

~~

We are choosing at this time to not split up our functions anymore then they already are. We have split up the original functions into smaller pieces two times already.

While we appreciate the suggestion for using tidyverse functions, and we do use them in many other contexts, due to the changes in the tidyverse packages, that are not always backwards compatible, we have chosen to avoid them in many cases to avoid this package breaking because of that in the future.

We have changed the name of the package to refsplitr to avoid the conflict on CRAN

We have changed references_read to have dir default to FALSE, to help alleviate the issues the reviewer had

We removed all the csv writing outputs for functions where that was not needed as apart of the author refining process.

All typos and other small changes have been made, thank you to both reviewers for catching them

We have removed any need for the google API from the package, since between when we submitted and now it can no longer be used for free.

We have fixed the plotting functions to the best of our ability, some we were unable to replicate. If the reviewers find them again, can they share their input file so we can better diagnose the issue?

The reviewer is correct in that ciw formats can also be processed, and we have revised the text of the vignette to reflect that. We were initially reluctant to mention this because we wanted to avoid users download files in proprietary formats, but ciw files can be opened with a text editor). We have also edited the Appendix showing how to download search results to include direct download of ciw files from the search results without going to marked list.

Reviewer comment : In my tests, export does not need to be via Marked list, but also works via download menu in search results, either as Endnote export (.ciw) or as 'Other file formats', with 'full record' and 'plain text' selected. This works much faster than via Marked lists.

Response: This is indeed faster because it eliminates several steps. However, this will download all records resulting from a search, including any that were incorrectly returned (e.g., those by an author with an identical name). If users wish to filtering results prior to download to avoid including unwanted publications, then the best approach to save only the desired records to the Marked List and download from there as either a .ciw or .txt. Appendix 1 has been amended to include this option.

maelle commented 5 years ago

:wave: @aurielfournier @birderboone! Thanks for your answer.

if we were able to get travis to use the github version of ggmap

You can do that! See https://docs.travis-ci.com/user/languages/r/#remote-package :-)

@njahn82 @bmkramer thanks again for your reviews. Are you happy with the authors' response above?

aurielfournier commented 5 years ago

Thanks for the link @maelle

I've added in the needed argument to the travis yml file, and the build still isn't working, though all the tests pass on my machine when I use the github version of ggmap, though it did take restarting everything to make that happen

So I'm not sure what is going on. :/

aurielfournier commented 5 years ago

Ok! I worked with some of the awesome ladies over in R-Ladies today, and we figured out the issue.

Its actually an issue with ggmap. Jenny Bryan opened up an issue in their repo about it. .

The solution that Jenny suggested was adding options(ggmap = list(display_api_key = FALSE)) at the top of authors_georef.R and now the build is passing. 🎉

maelle commented 5 years ago

Awesome, well done you and Jenny!

jennybc commented 5 years ago

If you're going to set the option in this way, which seems reasonable for a semi-temporary workaround, you should technically be a bit more careful to put things back the way you found them. You only want your value of FALSE to hold for the duration of this function's execution.

At the place where you set the option, you could capture the existing value and immediately use on.exit() to schedule its restoration. Or you could use withr::local_options() to accomplish both at once, with the downside that you'd need to Import withr.

dkahle commented 5 years ago

That problem should be fixed from ggmap's side now (with https://github.com/dkahle/ggmap/commit/0c68d5c); let me know if that doesn't do it. Sorry for the problem!

maelle commented 5 years ago

@njahn82 @bmkramer thanks again for your reviews. Are you happy with the authors' response above?

njahn82 commented 5 years ago

Sorry for my late reply. First of all thank you for your kind words and the changes you made. I am particularly impressed about your engagement with the R community to improve your work.

Before addressing the changes made, I wonder if I missed that runnable R code chunks were added to the README or vignette. As far as I see the vignette does not execute functions from the package, and there is no README.Rmd file. I am afraid it is formal requirement from rOpenSi that the vignette demonstrates that major functionality from the package runs successfully. As a user, I often look for such runnable examples before getting started with a package.

Can you point me to such document?

aurielfournier commented 5 years ago

Hi @njahn82 .

Perhaps I am misunderstanding the question, but refsplitr/vignettes/refsplitr-vignette.Rmd contains chunks of code that can be run by the user which execute each function from the package. Which is a change we made in this last revision.

for example: line 66

example_refs <- references_read(data = system.file("extdata", "example_data.txt", package = "refsplitr"),
                                    dir = FALSE)

Is this not what you meant?

I also just updated the ReadMe file to have a the same example shown in the vignette.

njahn82 commented 5 years ago

Sorry for the confusion. I thought of R code chunks indicated by curly brackets (```{r}) that are evaluated when a R Markdown document is rendered. The resulting output file shows the R output. In the vignette, it seems that package functions are highlighted (```r). When rendered, no R output is presented, but screenshots from spreadsheet software. Example: https://github.com/embruna/refsplitr/blob/16e7308fe75044e53848ab3bbecb80abb3cb7264/vignettes/refsplitr-vignette.Rmd#L99-L110

It would be be great to have some reproducible examples for the package's main functionalities.

aurielfournier commented 5 years ago

Agreed! We'll get right on it. Ggmap is giving us some issues again, but once we get those resolved we'll make those edits to the vignette and report back.

Thanks for clarifying!

aurielfournier commented 5 years ago

Alright. We resolved the issue with ggmap. A vignette with rendered R output is now in the repo! Let me know if you have any other comments!

maelle commented 5 years ago

:wave: @bmkramer @njahn82, are you both happy with the authors' response?

njahn82 commented 5 years ago

Unfortunately, I feel that some improvement is still needed.

It's great to have reproducible examples now in the vignettes. Sadly, I did not succeed building the vignette while installing the package.

So I used the rendered refsplitr-vignette.html file instead: When describing plot_net_country() and plot_net_address(), it would be better to call the $plot element directly to avoid that the other list elements are printed out. It would be great to have an example how users can generate and customize their own plots using the other outputs provided by these functions.

README.Rmd needs to be added to the .Rbuildignore file to make the package more CRAN compatible. If it is intended to submit the package to CRAN, dependencies listed in Remotes must be available via CRAN. Otherwise, there will be a warning when running R CMD check.

As noted, ownership change of the Web of Science needs to be addressed; since 2016 the Web of Science has been provided by Clarivate Analytics, and not Thomson Reuters.

I also noted that functions could be more thoroughly documented. All functions lack @examples tags followed by example R code on how to use the function. See also https://ropensci.github.io/dev_guide/building.html#examples

Source code should adhere to a code style, especially spacing, to improve the readability of the source code https://ropensci.github.io/dev_guide/building.html#code-style . Practice checks goodpractice::gp() and lintr::lint() help checking for good coding style.

Regarding the use of tidyverse, I am fine with not using it. However, as this package already makes heavy use of external packages including those from the tidyverse, I thought that it would make the programming of the package more coherent.

Lastly, while playing around with the plotting function plot_net_address(), I wondered if you want to support transparent edges by default. Then, overlapping edges would become more visible. Here's an example:

Default:

With alpha transparency set to 0.1

transparent_edges

I also realized that ggplot2::aes_string is used, which is soft-deprecated. It is recommended to use tidy evaluation idioms instead. Would it be possible to update the ggplot2 functions accordingly?

maelle commented 5 years ago

Thanks @njahn82 for these useful review comments! @aurielfournier @birderboone could you please address those?

aurielfournier commented 5 years ago

Hi, Thanks @njahn82 for the comments. We'll get them addressed, it may be a bit delayed though as I'm on day 1 of two straight weeks of all day courses, but hopefully by the end of the month.

aurielfournier commented 5 years ago

Hi All, Matt and I are working on this, but its likely going to be mid May before we have everything pulled together. We apologize for the delay, and thank you for your patience, we're both doing this outside of our day jobs.

maelle commented 5 years ago

:wave: @aurielfournier & @birderboone! Thanks for the update, I understand.

maelle commented 5 years ago

:wave: @aurielfournier & @birderboone! Mid-May is now, any update? :wink:

aurielfournier commented 5 years ago

Hi @maelle :D we (and by we I mean mostly @birderboone ) are working on it! Its close to being done, we should have stuff for you all by the end of the month. Thanks for your patience!

birderboone commented 5 years ago

Hello, So the package should be ready.

I have gone over it with lintr some more. I have not taken out upper case letters in variable names which lintr gets mentions frequently. I looked and I wasn't quite sure if this is mentioned in the ropensci/hadley style guide? I might not have looked hard enough. I didnt want to go down this path unless absolutely required since it would likely take a bit of time to make sure I found all instances of variable renames. So let me know if its essential.
I added @examples to every exported function, and included 2 data sets that are available to the users that are used in the examples
I did add an alpha argument to the plot_net_country function because it did make it look much better. In the future I'll probably have to think of ways to make the wrapped ggplot functions available for extra tweaking as I'm sure people will have further style suggestions as the package gets used more.
Additionally, the majority of the time required to finish these edits was due to changing the the internal structure of the package based on the other co-authors suggestions. So I apologize for the time delay. The outputs may have changed slightly due to these changes, however, the documentation should have changed to reflect those changes.

Thank you for your patience

Matt

maelle commented 5 years ago

Thanks @birderboone!

Regarding naming, the most important thing is to be consistent within the package, to make it easier for e.g. new contributors to pick things up.

@njahn82 does the response by the authors above address your concerns? Thanks in advance!

njahn82 commented 5 years ago

Thank you! I lack time to look into the changing of the internal structure of the package. However, there are still some issues with the vignette. To speed things up, I sent an pull request, which addresses the following:

YAML header uses standard vignette header
refplitr is now imported from source, not from a local directory ("/home/matt/r_programs/refsplitr")
Fixed R Markdown syntax for headings
Avoid loading pre-compiled images, using the output from the functions described in the vignette instead.
fix encoding of a French author name in bibliography
fix inline citation
fix some typos

In the vignette, there is a non-runnable code chunk, which fails when executed: https://github.com/embruna/refsplitr/blob/master/vignettes/refsplitr-vignette.Rmd#L192

Is it possible to support transparency in plot_net_address() as well?

I saw that aes_string() was changed to aes_. Unfortunately, this function will be soft-deprecated in the near future as well. It is recommended to use tidy evaluation idioms instead.

There are three warnings and two notes when the package is built using travis.

I do not want to be picky, but feel that fixing these issues will help to build trust in the functionalities of this useful package.

maelle commented 5 years ago

thanks a lot @njahn82! :smile_cat:

ropensci / software-review

Submission: refnet #256

Summary

Requirements

Publication options

Detail

Editor comments

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Runnable documentation

Installation and Building

Tests

Functions

Use of functions from other packages

Maintainability

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments