ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

wordVectors #92

Closed bmschmidt closed 5 years ago

bmschmidt commented 7 years ago

Summary

Package: wordVectors
Type: Package
Title: Tools for creating and analyzing vector-space models of texts
Version: 2.0
Date: 2016-01-23
Author: Ben Schmidt, Jian Li
Maintainer: Ben Schmidt <bmschmidt@gmail.com>
Description:
    wordVectors wraps Google's implementation in C for training word2vec models,
    and provides several R functions for exploratory data analysis of word2vec
    and other related models. These include import-export from the binary format,
    some useful linear algebra operations missing from R, and a streamlined
    syntax for working with models and performing vector arithmetic that make it
    easier to perform useful operations in a word-vector space.
License: Apache License (== 2.0)
Depends:
    R (>= 2.14.0)
LazyData: TRUE
Imports:
    graphics,
    methods,
    utils
Suggests:
    stringi,
    tsne,
    maggritr,
    testthat,
    ggplot2
RoxygenNote: 5.0.1

https://github.com/bmschmidt/wordVectors/tree/dev

(Note that this is a dev branch of the repo I'm submitting: I've made a couple breaking changes to the API for review, but don't want to inconvenience current users if more are in the pipeline.)

  1. People working with large textual collections who want to explore them with word2vec models they train themselves.

  2. People doing NLP tasks who want to incorporate existing models (like Google's news vectors) into their pipeline.

  3. Workshops, classrooms, and other situations where people want to promote exploration of word2vec or other models.

I built off of an existing package here on which development seems to be abandoned. My package has fewer bugs, and offers access to more of the underlying bindings. There's another package offering access to all the word2vec bindings here; it's maintainer has contributed patches to this project, and I think their functionality is roughly comparable on the core functions.

The big difference is that unlike either of these, my package offers import/export in the binary format, and wrappers for cosine similarity and the like, with syntactic sugar for word2vec models.

Requirements

Confirm each of the following by checking the box. This package:

Publication options

(how do I get a long-term repository for JOSS?)

Detail

Two warnings:

Number one has only recently appeared: the instructions (R CMD build --resave-data) seem not to work.

checking data for ASCII and uncompressed saves ... WARNING

  Note: significantly better compression could be obtained
        by using R CMD build --resave-data
                   old_size new_size compress
  demo_vectors.rda    927Kb    643Kb    bzip2checking compiled code ... WARNING

More seriously (?), some of the word2vec C code throws warnings because they use read/write functions. I don't know R's C bindings well enough to fix these; many people are currently using it in production.

File ‘wordVectors/libs/wordVectors.so’:
  Found ‘exit’, possibly from ‘exit’ (C)
    Object: ‘tmcn_word2vec.o’
  Found ‘puts’, possibly from ‘printf’ (C), ‘puts’ (C)
    Objects: ‘tmcn_distance.o’, ‘tmcn_word2vec.o’
  Found ‘stdout’, possibly from ‘stdout’ (C)
    Object: ‘tmcn_word2vec.o’

Build is tested on Ubuntu and OS X. I've also heard reports of builds failing on certain Windows versions. There's been much less of this since a few recent patches and I believe it to be fixed, but I don't have access to a test suite across old Windows versions to be sure.

  1. Variable naming is sometimes camel case; I have not changed this to avoid breaking users' existing code. I think that's it.

@lmullen has used the package and has contributed to you all a lot. He could perhaps suggest other uses.

noamross commented 7 years ago

Editor checks:


Thank you for the submission @bmschmidt! Traditionally something like wordVectors is a bit out-of-scope for us, but we are piloting accepting more methods-focused packages within some select sub-fields, starting with text mining. As such @lmullen will help out as co-editor here to find appropriate reviewers and ensure that implementation is well-reviewed and consistent with published methods.

@lmullen, can you give your thoughts on the additional utility of this package compared to the others above, as well as its relationship to other ones such as text2vec or h20?

Below is output from goodpractice::gp(). I note among other things that testing coverage is relatively low. covr::shine(covr::package_coverage() is a good way to look at what areas of code may be better tested. Reviewers may suggest additional testing. I suggest you integrate coverage testing into your CI (more info here) and add a coverage badge - we're moving to requiring this of all packages soon.

Using camelCase is fine to ensure back-compatibility. For Windows, if you're not already, consider using AppVeyor for CI or rhub for intermittent testing. The C warning issues we'll aim to address in review.

Nothing above is critical but I suggest you address these issues as you can while we finish editor's review and then seek reviewers.

── GP wordVectors ──────────────────────
It is good practice to

  ✖ write unit tests for all functions, and all package code in general. 14% of code lines are
    covered by test cases.

    R/matrixFunctions.R:41:NA
    R/matrixFunctions.R:85:NA
    R/matrixFunctions.R:86:NA
    R/matrixFunctions.R:87:NA
    R/matrixFunctions.R:88:NA
    ... and 915 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it gets invalid quite often. A build date
    will be added to the package when you perform `R CMD build` on it.
  ✖ add a "URL" field to DESCRIPTION. It helps users find information about your package online. If
    your package does not have a homepage, add an URL to GitHub, or the CRAN package package page.
  ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code hosting
    services provide bug trackers for free, https://github.com, https://gitlab.com, etc.
  ✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are
    used it and it is easier to read your code for them if you use '<-'.

    R/matrixFunctions.R:9:14
    R/matrixFunctions.R:35:17
    R/matrixFunctions.R:77:19
    R/matrixFunctions.R:81:34
    R/matrixFunctions.R:108:10
    ... and 102 more lines

  ✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that
    are about 80 characters wide. Try make your lines shorter than 80 characters

    R/matrixFunctions.R:85:1
    R/matrixFunctions.R:137:1
    R/matrixFunctions.R:190:1
    R/matrixFunctions.R:220:1
    R/matrixFunctions.R:275:1
    ... and 23 more lines

  ✖ avoid the library() and require() functions, they change the global search path. If you need to
    use other packages, import them. If you need to load them explicitly, then consider
    loadNamespace() instead, or as a last resort, declare them as 'Depends' dependencies.

    R/word2vec.R:143:7

  ✖ fix this R CMD check NOTE: 'library' or 'require' call to ‘stringi’ in package code. Please use
    :: or requireNamespace() instead. See section 'Suggested packages' in the 'Writing R Extensions'
    manual. Namespace in Imports field not imported from: ‘graphics’ All declared Imports should be
    used.
  ✖ fix this R CMD check NOTE: project: no visible global function definition for ‘new’
    square_magnitudes: no visible global function definition for ‘.hasSlot’
    -,VectorSpaceModel-VectorSpaceModel: no visible global function definition for ‘callNextMethod’
    -,VectorSpaceModel-VectorSpaceModel: no visible global function definition for ‘new’
    [,VectorSpaceModel: no visible global function definition for ‘callNextMethod’
    [,VectorSpaceModel: no visible global function definition for ‘new’ initialize,VectorSpaceModel:
    no visible global function definition for ‘callNextMethod’ plot,VectorSpaceModel: no visible
    global function definition for ‘predict’ plot,VectorSpaceModel: no visible global function
    definition for ‘prcomp’ plot,VectorSpaceModel: no visible global function definition for ‘text’
    Undefined global functions or variables: .hasSlot callNextMethod new prcomp predict text
    Consider adding importFrom("graphics", "text") importFrom("methods", ".hasSlot",
    "callNextMethod", "new") importFrom("stats", "prcomp", "predict") to your NAMESPACE file (and
    ensure that your DESCRIPTION Imports field contains 'methods').
  ✖ fix this R CMD check WARNING: Note: significantly better compression could be obtained by using
    R CMD build --resave-data old_size new_size compress demo_vectors.rda 927Kb 643Kb bzip2
  ✖ fix this R CMD check NOTE: Found the following hidden files and directories: .travis.yml These
    were most likely included in error. See section ‘Package structure’ in the ‘Writing R
    Extensions’ manual. (use .Rbuildignore for this)
  ✖ avoid 'T' and 'F', as they are just variables which are set to the logicals 'TRUE' and 'FALSE'
    by default, but are not reserved words and hence can be overwritten by the user.  Hence, one
    should always use 'TRUE' and 'FALSE' for the logicals.

    R/matrixFunctions.R:NA:NA
    R/matrixFunctions.R:NA:NA
    R/word2vec.R:NA:NA
    R/word2vec.R:NA:NA
    R/word2vec.R:NA:NA
    ... and 7 more lines

───────────────────

(Also devtools::spell_check() found "interepreted" at nearest_to.Rd:38)

Reviewers: @juliasilge @jeroenooms @tedunderwood Due date: 2017-03-07

noamross commented 7 years ago

Welcome reviewers @juliasilge, @jeroenooms, and @tedunderwood! As we've update the review process a bit, so please take a look at the latest reviewing and packaging guides. We now also have a review template. @tedunderwood, Thanks for coming aboard - @lmullen recommended you specifically in regard to reviewing the method implementation. Of course you may comment on all topics as you see fit.

jeroen commented 7 years ago

Review 1

Hi Benjamin. Thank you for this submission, which seems very useful.

Unfortunately there are serious problems with the C code that need to be addressed before this can be published, either on ropensci or CRAN.

CMD check

The package should pass without warnings or errors on CMD check. Currently you have the following problems:

* checking dependencies in R code ... NOTE
'library' or 'require' call to ‘stringi’ in package code.
  Please use :: or requireNamespace() instead.
  See section 'Suggested packages' in the 'Writing R Extensions' manual.

It literally tells you what to do. Do not use library but instead call the function from the other package via pkg::fun().

distend: no visible global function definition for ‘new’
improve_vectorspace: no visible global function definition for ‘new’
improve_vectorspace: no visible global function definition for ‘prcomp’
project: no visible global function definition for ‘new’
square_magnitudes: no visible global function definition for ‘.hasSlot’
-,VectorSpaceModel-VectorSpaceModel: no visible global function
  definition for ‘new’
[,VectorSpaceModel: no visible global function definition for
  ‘callNextMethod’
[,VectorSpaceModel: no visible global function definition for ‘new’
initialize,VectorSpaceModel: no visible global function definition for
  ‘callNextMethod’
plot,VectorSpaceModel: no visible global function definition for
  ‘predict’
plot,VectorSpaceModel: no visible global function definition for
  ‘prcomp’
plot,VectorSpaceModel: no visible global function definition for ‘text’
Undefined global functions or variables:
  .hasSlot callNextMethod new prcomp predict text
Consider adding
  importFrom("graphics", "text")
  importFrom("methods", ".hasSlot", "callNextMethod", "new")
  importFrom("stats", "prcomp", "predict")
to your NAMESPACE file (and ensure that your DESCRIPTION Imports field
contains 'methods').

You are missing a bunch of imports. Please make sure you @import all the functions that you use from other packages. All these notes above should disappear.

Compiler warnings

Packages should build without warnings when compiling with -Wall -pedantic. Compiler warnings are an indication that there are serious bugs in your code that can crash R. We don't want that. Compiling the package on OSX gives me the following warnings:

In file included from tmcn_distance.c:2:
./distance.h:25:12: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
  char st1[max_size];
           ^~~~~~~~
./distance.h:26:17: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
  char bestw[N][max_size];
                ^~~~~~~~
./distance.h:26:14: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
  char bestw[N][max_size];
             ^
./distance.h:27:18: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
  char file_name[max_size], st[100][max_size];
                 ^~~~~~~~
./distance.h:27:37: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
  char file_name[max_size], st[100][max_size];
                                    ^~~~~~~~
./distance.h:28:26: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
  float dist, len, bestd[N], vec[max_size];
                         ^
./distance.h:28:34: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
  float dist, len, bestd[N], vec[max_size];
                                 ^~~~~~~~
./distance.h:40:35: warning: if statement has empty body [-Wempty-body]
  if(fscanf(f, "%lld", &words)==1);
                                  ^
./distance.h:40:35: note: put the semicolon on a separate line to silence this warning
./distance.h:41:34: warning: if statement has empty body [-Wempty-body]
  if(fscanf(f, "%lld", &size)==1);
                                 ^
./distance.h:41:34: note: put the semicolon on a separate line to silence this warning
./distance.h:42:19: warning: implicitly declaring library function 'malloc' with type 'void *(unsigned long)' [-Wimplicit-function-declaration]
  vocab = (char *)malloc((long long)words * max_w * sizeof(char));
                  ^
./distance.h:42:19: note: include the header <stdlib.h> or explicitly provide a declaration for 'malloc'
./distance.h:49:53: warning: if statement has empty body [-Wempty-body]
    if(fscanf(f, "%s%c", &vocab[b * max_w], &ch)==1);
                                                    ^
./distance.h:49:53: note: put the semicolon on a separate line to silence this warning
./distance.h:50:83: warning: if statement has empty body [-Wempty-body]
    for (a = 0; a < size; a++) if(fread(&M[a + b * size], sizeof(float), 1, f)==1);
                                                                                  ^
./distance.h:50:83: note: put the semicolon on a separate line to silence this warning
clang -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include   -Wall -pedantic -fPIC  -Wall -mtune=core2 -g -O2  -c tmcn_word2vec.c -o tmcn_word2vec.o
12 warnings generated.
In file included from tmcn_word2vec.c:3:
./word2vec.h:321:51: warning: if statement has empty body [-Wempty-body]
    if(fscanf(fin, "%lld%c", &vocab[a].cn, &c)==1);
                                                  ^

Illegal calls

CMD check also warns about the following:

File ‘wordVectors/libs/wordVectors.so’:
  Found ‘___stdoutp’, possibly from ‘stdout’ (C)
    Object: ‘tmcn_word2vec.o’
  Found ‘_exit’, possibly from ‘exit’ (C)
    Object: ‘tmcn_word2vec.o’
  Found ‘_printf’, possibly from ‘printf’ (C)
    Objects: ‘tmcn_distance.o’, ‘tmcn_word2vec.o’
  Found ‘_puts’, possibly from ‘printf’ (C), ‘puts’ (C)
    Objects: ‘tmcn_distance.o’, ‘tmcn_word2vec.o’

It means your C code is calling stdout(), printf(), exit(), puts() which are not allowed because again, they can crash R.

Instead of printf you might be able to use Rprintf. R code should never call exit() because that will quit R. Instead your code should raise an error Rf_error when something is wrong.

.C vs .Call

You are using the .C interface to call your C code from R. This interface is very old and should not be used anymore for new packages. It is highly recommended to use the .Call system instead, which allows you to do type checking and manipulate R objects on the C level.

You borrow examples from almost any other popular R package with C code (digest, jsonlite, etc)

Windows

You are overriding compiler settings for windows in your Makevars.win file. You should never do this unless you know what you are doing. Please delete the PKG_CFLAGS line from your Makevars.win file.

noamross commented 7 years ago

Thanks, @jeroenooms! @bmschmidt, it will be easier to get through the rest of the review with these issues fixed. Please let us know if we can help with addressing them.

I'd also suggest switching to warnings_are_errors: true in your .travis.yml for more thorough checks.

bmschmidt commented 7 years ago

Thanks everyone for signing on here. It's a great set of reviewers, excited to hear from you all.

minor fixes

To address @noamross's questions I've removed warnings_are_errors, and changed a little bit in the testing to up testing coverage to 48%. It now fails under R 3.4 (or whatever devel is on travis) with a new error about vignette documentation, but builds under the current version.


Jeroen Ooms response

Thank you for these notes, @jeroenooms, they are very helpful. In reading them I realized I added some code after submitting here that adding a lot of cruft to the checks after submitting; my apologies for that.

I've fixed a lot of this in the latest commit, including compiler warnings, library calls, and improper usage of stdout and fprint.

There are two issues that I haven't fixed and would appreciate any advice on.

.C vs .Call.

I haven't yet fixed this because it looks like a fair amount of work for me on the C side, since I don't really know what I'm doing there, without a payoff. I've read some about the advantages of .Call over .C, and they don't really seem to apply here. Everything being passed in the .C calls are strings or single numbers substituting for command-line flags in the original code; it's never more than a couple dozen bytes, so there's no performance effect.

If I were more comfortable interfacing C and R code I would switch this right away. But to rewrite functions to use .Call seems to requiring learning a bit about internals that I'm not interested in right now, and debugging is difficult because my whole installation crashes on a mistake. (Do I need to run inside a C debugger, or what?)

So I guess I'm asking: how serious an issue this is?

Windows

The Makevars.win override was a pull request from someone that may have fixed a lot of Windows errors I was hearing about. So while I don't know what I'm doing, it's possible that the contributor did. I will try removing it once I get some integration tests on Windows running using @noamross's suggestions above and see what happens.

bmschmidt commented 7 years ago

Ugh. Testing out the vignettes I realized that this latest batch of changes to the C code seems to have created a memory leak that crashes R on anything other than toy data. So maybe I have to learn more C debugging than I wanted after all. Just an FYI before anyone tries to run this.

bmschmidt commented 7 years ago

OK, that issue is now fixed, but updates on builds are no longer printed to the console. I've added an issue within the repo here; if anyone knows why Rprintf would crash R through a memory leak where printf didn't (the opposite of what's supposed to happen?) I'd welcome any comments there or here.

juliasilge commented 7 years ago

Package Review

I am so happy to be reviewing this package! I have enjoyed playing around with its functionality.

> model %>% nearest_to(model[["darcy"]] - model[["pemberley"]] + model[["hartfield"]])
  knightley   hartfield       elton      martin      weston     harriet        emma knightley's   woodhouse      emma's 
  0.4152193   0.4363070   0.4769714   0.4921517   0.5070818   0.5139842   0.5250399   0.5654097   0.5708246   0.5861962 

> model %>% nearest_to(model[["darcy"]] - model[["pemberley"]] + model[["norland"]])
    norland      elinor  devonshire    marianne    dashwood      divide      sussex unavoidable      mother     ferrars 
  0.3098889   0.5843901   0.5943843   0.6005002   0.6231055   0.6261682   0.6280522   0.6293118   0.6393082   0.6394483 

> cosineSimilarity(model[[c("darcy","knightley","tilney"), average=FALSE]], 
                   model[[c("elizabeth","emma","catherine"), average=FALSE]])
                emma   elizabeth  catherine
darcy     0.02762390  0.67302184  0.1003745
knightley 0.69530175 -0.07843539 -0.1304497
tilney    0.07108335  0.12679012  0.637369

Documentation

The package includes all the following forms of documentation:

Functionality

Final approval (post-review)

Estimated hours spent reviewing: 2


Review Comments

tedunderwood commented 7 years ago

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Functionality

Estimated hours spent reviewing: 2


Review Comments

I don't have much experience with R package development, so I haven't focused on unit tests or packaging guidelines in these comments. I think the other reviewers in this group will address those questions more effectively.

Instead I've looked at the package from the perspective of a user interested in text mining.

The need for the package is clear; the audience of scholars interested in word2vec is definitely larger than the audience comfortable with existing implementations. And I do think this implementation successfully streamlines the task of training, and applying, a word2vec model. In particular, the inclusion of a function that prepares a corpus for training is an important bridge over an otherwise tricky phase of this process.

Installation was straightforward, and I had no difficulty following instructions to prepare a corpus and train a model. I encountered no warnings or errors.

I did find that I had some unanswered questions. It’s not difficult to extract a vocabulary from the VectorSpaceModel object, but I found myself wondering whether this is an absolutely complete vocabulary for the corpus, or whether there is some minimum frequency threshold. If it’s the former, nothing needs to be added beyond a single brief sentence somewhere in the documentation. Users may also wonder about the frequency threshold for fusing common bigrams.

The vignette and README files were very informative, and guided me successfully through early stages of EDA on the VectorSpaceModel (e.g. the kmeans example is very useful). But I did find myself wishing that I also had a document like the .pdf package descriptions I’m accustomed to from CRAN, which simply list all the functions available in a given package. Of course, the github repo itself does this. I was able to read the vignette on github, but not actually able to invoke it from within RStudio.

In preparing my corpus, prep_word2vec() turned all linebreaks into NA. NA then also appeared as the second most common “word” in the vocabulary, and as a separate row in the matrix. This may be a standard aspect of corpus prep rather than a bug, but I think it’s possible that some users will be puzzled by it — so, even if it is expected behavior, some explanation should probably be provided.

(If it's not expected behavior, I could share the files to help reproduce it. I think these were just utf-8 txt files; I was running R 3.3.2 under Mac OS 10.12.3.)

Generally, though, this strikes me as a trouble-free, well-documented implementation of an important emerging method, likely to get wide usage.

bmschmidt commented 7 years ago

Thanks so much for your comments, @tedunderwood and @juliasilge. Your time is muc appreciated.

I really hate to say this, but I believe, in reading @juliasilge's comments, that she/you must have tested the master branch which is several commits behind dev. Most of the problems are now fixed, including some greater coverage on the unit tests. I feel very bad about this, because it's obviously my fault for burying such an awkward request in the submission issue here. Sorry!

So, @juliasilge: I don't want to make you spend any more time, but given how much I appreciate your super-useful work with tidytext, as long as I've got you here I just want to flag for you the dev-branch vignette specifically about exploratory analysis (http://htmlpreview.github.io/?https://github.com/bmschmidt/wordVectors/blob/dev/inst/doc/exploration.html) which uses Hadleyverse-like formula interfaces on text instead of double brackets for access: eg, vectorset %>% nearest_to(~"good"*3 + "bad). I don't know that it's super important, but it is a neat syntactic trick that seems to highlight some R-language advantages specifically over python. And you might be happy that the output from nearest_to is now a data.frame, which means it could be easily integrated into a tidytext chain (eg, instead of using a pre-built sentiment dictionary, tag sentiment by polarity along the vector "good" - "bad"). There's also a bigger introductory vignette is here which isn't very different from the old README.md.

I'm not sure what the next steps are: there are several issues above to address. Would it be helpful for me to pull what I see as the unresolved ones into a comment, @noamross?

bmschmidt commented 7 years ago

I'm just going to post a partial list here and update it as time goes on. Hopefully that method won't spam anyone. I guess I could file these as issues on my repo, too.

From @jeroenooms

From @juliasilge

From @tedunderwood

noamross commented 7 years ago

Thanks for your reviews, @tedunderwood and @juliasilge. Sorry that this process has been a bit more convoluted than usual. @jeroenooms was waiting on getting the C issues fixed to the point of passing R CMD Check before completing review and doing a checklist (Jeroen, perhaps you can comment on https://github.com/bmschmidt/wordVectors/issues/29). So @bmschmidt, I suggest you proceed fixing the issues listed above, and when you've addressed them merge to master for second round comments.

noamross commented 7 years ago

@bmschmidt Any updates on the above?

noamross commented 7 years ago

@bmschmidt Just checking in on your progress here. Are you planning on revising, and do you have an expected timeline?

sckott commented 7 years ago

apply holding label?

noamross commented 5 years ago

Closing this submission as it has been dormant for > 1 yr. Authors may re-submit if they would like to continue the process.