ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

JSTORr package #86

Closed benmarwick closed 6 years ago

benmarwick commented 7 years ago

Summary

The aim of this package is provide some simple functions in R to explore changes in word frequencies over time in a specific journal archive. It is designed to solve the problem of finding patterns and trends in the unstructured text content of a large number of scholarly journals articles from the JSTOR archive.

Package: JSTORr
Type: Package
Title: Simple text mining and document clustering of JSTOR journal articles
Version: 1.0.20161214
Date: 2016-01-03
Author: Ben Marwick
Maintainer: Ben Marwick <bmarwick@gmail.com>
Description: Simple exploratory text mining and document clustering of journal
    articles from JSTOR's Data for Research service. Go to
    http://dfr.jstor.org/, make a request for data (specify CSV as outout
    format and Word Counts as data type), then once you get a zip file, unzip
    it and start with one of the unpack functions and then you're ready to go
    with any of the other functions. For more details on installation and
    usage, see https://github.com/benmarwick/JSTORr
License: MIT + file LICENSE
Imports:
    ggplot2,
    reshape2,
    plyr,
    stringr,
    tm,
    openNLP,
    NLP,
    lda,
    XML,
    cluster,
    apcluster,
    ggdendro,
    FactoMineR,
    gridExtra,
    grid,
    data.table,
    snowfall,
    slam,
    digest,
    igraph,
    snow,
    devtools
Suggests: testthat
LazyData: true

Requirements

Confirm each of the following by checking the box. This package:

Publication options

Detail

@brooksambrose has used the pkg and submitted a very nice PR recently. @juliasilge has recently transformed text mining in R with her pkg with @dgrtwo. @noamross saw this pkg submitted to JOSS and recommended I submit it here first.

sckott commented 7 years ago

Editor checks:


Editor comments

Currently seeking reviewers. It's a good fit and not overlapping.

It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. At least some lines are not covered in this package.

    R/JSTOR_1bigram.R:16:NA
    R/JSTOR_1bigram.R:17:NA
    R/JSTOR_1bigram.R:22:NA
    R/JSTOR_1bigram.R:23:NA
    R/JSTOR_1bigram.R:25:NA
    ... and 1154 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it
    gets invalid quite often. A build date will be added to the package
    when you perform `R CMD build` on it.
  ✖ add a "URL" field to DESCRIPTION. It helps users find
    information about your package online. If your package does not
    have a homepage, add an URL to GitHub, or the CRAN package package
    page.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to
    a bug tracker. Many online code hosting services provide bug
    trackers for free, https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the
    standard, and R users and developers are used it and it is easier
    to read your code for them if you use '<-'.

    R/JSTOR_clusterbywords.R:200:12
    R/JSTOR_unpack_multiple_archives.R:77:5
    R/JSTOR_unpack_multiple_archives.R:78:5
    R/JSTOR_unpack_multiple_archives.R:79:5
    R/JSTOR_unpack1grams.R:76:7
    ... and 5 more lines

  ✖ avoid long code lines, it is bad for readability. Also,
    many people prefer editor windows that are about 80 characters
    wide. Try make your lines shorter than 80 characters

    R/JSTOR_1bigram.R:21:1
    R/JSTOR_1bigram.R:31:1
    R/JSTOR_1bigram.R:34:1
    R/JSTOR_1bigram.R:41:1
    R/JSTOR_1bigram.R:50:1
    ... and 271 more lines

  ✖ avoid calling setwd(), it changes the global environment.
    If you need it, consider using on.exit() to restore the working
    directory.

    R/JSTOR_MALLET.R:21:3
    R/JSTOR_MALLET.R:27:20
    R/JSTOR_MALLET.R:49:1
    R/JSTOR_unpack1grams.R:30:3
    R/JSTOR_unpack1grams.R:118:3
    ... and 2 more lines

  ✖ avoid sapply(), it is not type safe. It might return a
    vector, or a list, depending on the input data. Consider using
    vapply() instead.

    R/JSTOR_MALLET_hotncoldtopics.R:107:65
    R/JSTOR_MALLET_topicsovertime.R:161:5
    R/JSTOR_MALLET_topicsovertime.R:191:19
    R/JSTOR_unpack_multiple_archives.R:67:9
    R/JSTOR_unpack1grams.R:63:9

  ✖ avoid the library() and require() functions, they change
    the global search path. If you need to use other packages, import
    them. If you need to load them explicitly, then consider
    loadNamespace() instead, or as a last resort, declare them as
    'Depends' dependencies.
  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...),
    1:NROW(...) and 1:NCOL(...) expressions. They are error prone and
    result 1:0 if the expression on the right hand side is zero. Use
    seq_len() or seq_along() instead.

    R/JSTOR_clusterbywords.R:44:32
    R/JSTOR_clusterbywords.R:100:14
    R/JSTOR_dtmofnouns.R:43:44
    R/JSTOR_findassocs.R:160:39
    R/JSTOR_findassocs.R:166:39
    ... and 21 more lines

  -- more lines but not including for brevity sake

There's a bunch of non-ascii text in some data stored in the pkg, check out tools::showNonASCIIfile()


Reviewers: @leeper @kbenoit Due date: 2017-02-03

leeper commented 7 years ago

I just took an initial look at this and am seeing lots of R CMD check warnings and notes (which are also shown on Travis). Most of them relate to dependencies because of how some of the packages are being used (namely using library() and unqualified package function calls). Most of these appear to be due to:

  1. unnecessary library() calls given that dependencies are already being imported (either switch these to Suggests and requireNamespace() or leave as imports without library()),
  2. missing imports of base packages (e.g., utils, stats), and
  3. ggplot2 variables (see this post for how to handle that).

Do you want to clean these up first before I review?

sckott commented 7 years ago

missing imports of base packages (e.g., utils, stats)

@benmarwick you could also just namespace these so you don't have to list them in imports, like stats::setNames

benmarwick commented 7 years ago

Thanks, I'll clean these up and get back to you with an update here

sckott commented 7 years ago

both reviewers assigned.

@benmarwick ping back here when you get that stuff cleaned up

kbenoit commented 7 years ago

Maybe this is in a holding stage but in the spirit of better late than never, I've refreshed my clone of this repo and refreshed the comments I started last month. Here they are.

Package Review

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 2.5


Review Comments

Basic motivation and value added of the package

Below, I've included some practical issues that could be addressed to improve the package and which have fairly straightforward solutions (such as: use knitr for the README; write a vignette; etc.). Before turning to those, however, I will give some tough but I hope fair comments about what I the basic need for such a package.

In essence, this package is a wrapper around a set of text-based utilities to analyze texts with a specific data structure, namely the article texts and some document-level variables from JSTOR. Its stated purpose is to perform word analysis, clustering, and topic modelling on the texts, with the option of using the document-level variables to show topic prevalence across time. However these really have nothing to do with JSTOR, but rather to do with the list structure crated by the JSTOR_unpack* functions that transform the JSTOR .csv files into a list of a document-term matrixes and the document-level variables. The parts specific to JSTOR or article data, in other words, are only in having the time variable, and this could be part of any textual data structure.

When I first cloned the package, I thought it would have something to do with querying data from a JSTOR API, or special processing for JSTOR data, but in fact it has nothing to do with that. Rather it reads in a .csv or .tsv file with only the expectation that specific fields exist in the data. The interaction with JSTOR is entirely up to the user, and the package has to assume that the data provided by JSTOR keeps the same structure as it is expecting.

Now here I am going to be that annoying reviewer who suggests citing his or her own work, in a slightly different way, but there are packages to do this, including the one I've worked on, and which most users tell me is more straightforward than tm: the package quanteda, which can along with a companion package readtext, read in files from .zip, .csv, .tsv data and make that into a corpus, including automatic processing of the document variables, and then create document-feature matrixes that (at least in the current GitHub version) retain the document variables. This makes it easy to keep them for things like the stm package which can use additional variables for fitting correlated LDA models, or machine learning, etc. quanteda can do almost all that you are doing in this package including:

It does not directly filter nouns but we are hard at work on a package spacyr that will tag a quanteda corpus and add new token selection tools to select on parts of speech.

These are found at http://github.com/kbenoit/quanteda, http://github.com/kbenoit/readtext, and http://github.com/kbenoit/spacyr.

Ok, so now </annoyingselfpromotion>, but please convince me that this set of wrappers, which lacks the flexibility of a more basic approach, is better for most users, especially when some of the functions reimplement completely generic tools while purportedly doing so in a way that has something to do with JSTOR, such as JSTORE_findassocs. To my mind, this would be like creating a package called worldbank_lm which fitted linear models to the World Bank development indicators dataset, after first getting the instructions to download this dataset and load it through a wrapper to read.table called worldbank_readtable. Maybe there is a case to be made that this will help R-challenged biliometricians, but if so then this package should make that case. Otherwise, while not quite reinventing the wheel, it's creating an additional wheel inside which is wrapped the original wheel which might be better and more effective without the wrapper.

Another way to make this case, of course, is to convince me what is special about the wrapper set that extracts more value from the JSTOR data specifically. This goes beyond an ease of use motivation, because it argues that putting together components in a specially selected way leads to specific insights about journal articles.

Don't get me wrong, I have total respect for anyone taking their precious time and effort to write packages and to tidy them up sufficiently to submit them to rOpenSci, and I don't want to discourage that. But part of that process involves applying the rOpenSci standards, and I think this package needs much stronger motivation to meet those.

Note also that I am being slightly hypocritical in that two of the three packages I list above are not yet on CRAN (one because of installation issues on Windows involving the Python/Cython links), and quanteda I have not yet submitted to rOpenSci. But we have been working toward that with the standards firmly in mind. If you wanted to use more of quanteda for the scaffolding of this package I would be happy to advise or assist.

Now on to more mundane issues.

Installation problems

Overall, I could not test the package because I could not attach it. I followed the installation instructions in the README.md, and was able to build the package, but then got this: After I build the package, it fails to load:

> library(JSTORr)
Error : .onLoad failed in loadNamespace() for 'rJava', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so
  Reason: image not found
In addition: Warning messages:
1: replacing previous import ‘NLP::annotate’ by ‘ggplot2::annotate’ when loading ‘JSTORr’
2: replacing previous import ‘apcluster::similarity’ by ‘igraph::similarity’ when loading ‘JSTORr’
Error: package or namespace load failed for ‘JSTORr’

I'm running:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] igraph_1.0.1         Rcpp_0.12.9          magrittr_1.5         knitr_1.15.1         cluster_2.0.5        MASS_7.3-45          leaps_3.0           
 [8] munsell_0.4.3        scatterplot3d_0.3-38 colorspace_1.3-2     lattice_0.20-34      ggdendro_0.1-20      plyr_1.8.4           tools_3.3.2         
[15] grid_3.3.2           data.table_1.10.4    gtable_0.2.0         lazyeval_0.2.0       digest_0.6.12        assertthat_0.1       tibble_1.2          
[22] Matrix_1.2-8         NLP_0.1-10           lda_1.4.2            gridExtra_2.2.1      FactoMineR_1.35      apcluster_1.4.3      ggplot2_2.2.1       
[29] scales_0.4.1         flashClust_1.01-2    XML_3.98-1.5   

I admit to having the same problems when trying to get other Java-based R packages to work. (This is why I have given up on Java-R integration and shifted to the Python-based http://spaCy.io).

Namespace conflicts, imports

When building or running check, I got numerous warnings due to the very large number of packages whose namespace this package imports in their entirety. These can be eliminated by importing just the functions you need from the packages, rather than their entire namespaces:

* checking whether package ‘JSTORr’ can be installed ... WARNING
Found the following significant warnings:
  Warning: replacing previous import ‘NLP::annotate’ by ‘ggplot2::annotate’ when loading ‘JSTORr’
  Warning: replacing previous import ‘apcluster::similarity’ by ‘igraph::similarity’ when loading ‘JSTORr’
  Warning: replacing previous import ‘data.table::melt’ by ‘reshape2::melt’ when loading ‘JSTORr’
  Warning: replacing previous import ‘data.table::dcast’ by ‘reshape2::dcast’ when loading ‘JSTORr’
  Warning: replacing previous import ‘igraph::%>%’ by ‘stringr::%>%’ when loading ‘JSTORr’

It's overkill to import the whole package, e.g. in JSTOR_dtmofnouns.R:

#' @import tm NLP data.table openNLP

No demonstration of how it works, or vignette

Use README.Rmd instead of README.md, and linking this to commit hooks to ensure it is always knitted. The README.md contains executable code demonstrating the package, but contains no output. It would be far better to demonstrate what the package does through seeing the knitted output. Alternatively or even better, in addition, this could form part of a vignette with more extended examples.

Other issues

sckott commented 7 years ago

thanks for your review @kbenoit !

@benmarwick any updates on your last comment https://github.com/ropensci/onboarding/issues/86#issuecomment-269055379 ?

sckott commented 7 years ago

@benmarwick any updates on your last comment https://github.com/ropensci/onboarding/issues/86#issuecomment-269055379 ?

benmarwick commented 7 years ago

Yes, thanks for the reminder, and @kbenoit for the comprehensive and thoughtful review. I'll post a more substantial response within this week.

sckott commented 7 years ago

@benmarwick should we continue to hold on this?

benmarwick commented 7 years ago

ok to close this for now.

sckott commented 7 years ago

@benmarwick what do you mean? like you need a while before you can get back to this? if so we can keep the holding tag on it, and just revisit every once in a while

OR, are you saying you don't want to go through with the submission anymore?

benmarwick commented 7 years ago

Holding sounds good, thanks!

sckott commented 7 years ago

okay, then I'll reopen