Closed bmschmidt closed 5 years ago
Thank you for the submission @bmschmidt! Traditionally something like wordVectors is a bit out-of-scope for us, but we are piloting accepting more methods-focused packages within some select sub-fields, starting with text mining. As such @lmullen will help out as co-editor here to find appropriate reviewers and ensure that implementation is well-reviewed and consistent with published methods.
@lmullen, can you give your thoughts on the additional utility of this package compared to the others above, as well as its relationship to other ones such as text2vec or h20?
Below is output from goodpractice::gp()
. I note among other things that testing coverage is relatively low. covr::shine(covr::package_coverage()
is a good way to look at what areas of code may be better tested. Reviewers may suggest additional testing. I suggest you integrate coverage testing into your CI (more info here) and add a coverage badge - we're moving to requiring this of all packages soon.
Using camelCase is fine to ensure back-compatibility. For Windows, if you're not already, consider using AppVeyor for CI or rhub for intermittent testing. The C warning issues we'll aim to address in review.
Nothing above is critical but I suggest you address these issues as you can while we finish editor's review and then seek reviewers.
── GP wordVectors ──────────────────────
It is good practice to
✖ write unit tests for all functions, and all package code in general. 14% of code lines are
covered by test cases.
R/matrixFunctions.R:41:NA
R/matrixFunctions.R:85:NA
R/matrixFunctions.R:86:NA
R/matrixFunctions.R:87:NA
R/matrixFunctions.R:88:NA
... and 915 more lines
✖ omit "Date" in DESCRIPTION. It is not required and it gets invalid quite often. A build date
will be added to the package when you perform `R CMD build` on it.
✖ add a "URL" field to DESCRIPTION. It helps users find information about your package online. If
your package does not have a homepage, add an URL to GitHub, or the CRAN package package page.
✖ add a "BugReports" field to DESCRIPTION, and point it to a bug tracker. Many online code hosting
services provide bug trackers for free, https://github.com, https://gitlab.com, etc.
✖ use '<-' for assignment instead of '='. '<-' is the standard, and R users and developers are
used it and it is easier to read your code for them if you use '<-'.
R/matrixFunctions.R:9:14
R/matrixFunctions.R:35:17
R/matrixFunctions.R:77:19
R/matrixFunctions.R:81:34
R/matrixFunctions.R:108:10
... and 102 more lines
✖ avoid long code lines, it is bad for readability. Also, many people prefer editor windows that
are about 80 characters wide. Try make your lines shorter than 80 characters
R/matrixFunctions.R:85:1
R/matrixFunctions.R:137:1
R/matrixFunctions.R:190:1
R/matrixFunctions.R:220:1
R/matrixFunctions.R:275:1
... and 23 more lines
✖ avoid the library() and require() functions, they change the global search path. If you need to
use other packages, import them. If you need to load them explicitly, then consider
loadNamespace() instead, or as a last resort, declare them as 'Depends' dependencies.
R/word2vec.R:143:7
✖ fix this R CMD check NOTE: 'library' or 'require' call to ‘stringi’ in package code. Please use
:: or requireNamespace() instead. See section 'Suggested packages' in the 'Writing R Extensions'
manual. Namespace in Imports field not imported from: ‘graphics’ All declared Imports should be
used.
✖ fix this R CMD check NOTE: project: no visible global function definition for ‘new’
square_magnitudes: no visible global function definition for ‘.hasSlot’
-,VectorSpaceModel-VectorSpaceModel: no visible global function definition for ‘callNextMethod’
-,VectorSpaceModel-VectorSpaceModel: no visible global function definition for ‘new’
[,VectorSpaceModel: no visible global function definition for ‘callNextMethod’
[,VectorSpaceModel: no visible global function definition for ‘new’ initialize,VectorSpaceModel:
no visible global function definition for ‘callNextMethod’ plot,VectorSpaceModel: no visible
global function definition for ‘predict’ plot,VectorSpaceModel: no visible global function
definition for ‘prcomp’ plot,VectorSpaceModel: no visible global function definition for ‘text’
Undefined global functions or variables: .hasSlot callNextMethod new prcomp predict text
Consider adding importFrom("graphics", "text") importFrom("methods", ".hasSlot",
"callNextMethod", "new") importFrom("stats", "prcomp", "predict") to your NAMESPACE file (and
ensure that your DESCRIPTION Imports field contains 'methods').
✖ fix this R CMD check WARNING: Note: significantly better compression could be obtained by using
R CMD build --resave-data old_size new_size compress demo_vectors.rda 927Kb 643Kb bzip2
✖ fix this R CMD check NOTE: Found the following hidden files and directories: .travis.yml These
were most likely included in error. See section ‘Package structure’ in the ‘Writing R
Extensions’ manual. (use .Rbuildignore for this)
✖ avoid 'T' and 'F', as they are just variables which are set to the logicals 'TRUE' and 'FALSE'
by default, but are not reserved words and hence can be overwritten by the user. Hence, one
should always use 'TRUE' and 'FALSE' for the logicals.
R/matrixFunctions.R:NA:NA
R/matrixFunctions.R:NA:NA
R/word2vec.R:NA:NA
R/word2vec.R:NA:NA
R/word2vec.R:NA:NA
... and 7 more lines
───────────────────
(Also devtools::spell_check()
found "interepreted" at nearest_to.Rd:38)
Reviewers: @juliasilge @jeroenooms @tedunderwood Due date: 2017-03-07
Welcome reviewers @juliasilge, @jeroenooms, and @tedunderwood! As we've update the review process a bit, so please take a look at the latest reviewing and packaging guides. We now also have a review template. @tedunderwood, Thanks for coming aboard - @lmullen recommended you specifically in regard to reviewing the method implementation. Of course you may comment on all topics as you see fit.
Hi Benjamin. Thank you for this submission, which seems very useful.
Unfortunately there are serious problems with the C code that need to be addressed before this can be published, either on ropensci or CRAN.
The package should pass without warnings or errors on CMD check. Currently you have the following problems:
* checking dependencies in R code ... NOTE
'library' or 'require' call to ‘stringi’ in package code.
Please use :: or requireNamespace() instead.
See section 'Suggested packages' in the 'Writing R Extensions' manual.
It literally tells you what to do. Do not use library
but instead call the function from the other package via pkg::fun()
.
distend: no visible global function definition for ‘new’
improve_vectorspace: no visible global function definition for ‘new’
improve_vectorspace: no visible global function definition for ‘prcomp’
project: no visible global function definition for ‘new’
square_magnitudes: no visible global function definition for ‘.hasSlot’
-,VectorSpaceModel-VectorSpaceModel: no visible global function
definition for ‘new’
[,VectorSpaceModel: no visible global function definition for
‘callNextMethod’
[,VectorSpaceModel: no visible global function definition for ‘new’
initialize,VectorSpaceModel: no visible global function definition for
‘callNextMethod’
plot,VectorSpaceModel: no visible global function definition for
‘predict’
plot,VectorSpaceModel: no visible global function definition for
‘prcomp’
plot,VectorSpaceModel: no visible global function definition for ‘text’
Undefined global functions or variables:
.hasSlot callNextMethod new prcomp predict text
Consider adding
importFrom("graphics", "text")
importFrom("methods", ".hasSlot", "callNextMethod", "new")
importFrom("stats", "prcomp", "predict")
to your NAMESPACE file (and ensure that your DESCRIPTION Imports field
contains 'methods').
You are missing a bunch of imports. Please make sure you @import
all the functions that you use from other packages. All these notes above should disappear.
Packages should build without warnings when compiling with -Wall -pedantic
. Compiler warnings are an indication that there are serious bugs in your code that can crash R. We don't want that. Compiling the package on OSX gives me the following warnings:
In file included from tmcn_distance.c:2:
./distance.h:25:12: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
char st1[max_size];
^~~~~~~~
./distance.h:26:17: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
char bestw[N][max_size];
^~~~~~~~
./distance.h:26:14: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
char bestw[N][max_size];
^
./distance.h:27:18: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
char file_name[max_size], st[100][max_size];
^~~~~~~~
./distance.h:27:37: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
char file_name[max_size], st[100][max_size];
^~~~~~~~
./distance.h:28:26: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
float dist, len, bestd[N], vec[max_size];
^
./distance.h:28:34: warning: variable length array folded to constant array as an extension [-Wgnu-folding-constant]
float dist, len, bestd[N], vec[max_size];
^~~~~~~~
./distance.h:40:35: warning: if statement has empty body [-Wempty-body]
if(fscanf(f, "%lld", &words)==1);
^
./distance.h:40:35: note: put the semicolon on a separate line to silence this warning
./distance.h:41:34: warning: if statement has empty body [-Wempty-body]
if(fscanf(f, "%lld", &size)==1);
^
./distance.h:41:34: note: put the semicolon on a separate line to silence this warning
./distance.h:42:19: warning: implicitly declaring library function 'malloc' with type 'void *(unsigned long)' [-Wimplicit-function-declaration]
vocab = (char *)malloc((long long)words * max_w * sizeof(char));
^
./distance.h:42:19: note: include the header <stdlib.h> or explicitly provide a declaration for 'malloc'
./distance.h:49:53: warning: if statement has empty body [-Wempty-body]
if(fscanf(f, "%s%c", &vocab[b * max_w], &ch)==1);
^
./distance.h:49:53: note: put the semicolon on a separate line to silence this warning
./distance.h:50:83: warning: if statement has empty body [-Wempty-body]
for (a = 0; a < size; a++) if(fread(&M[a + b * size], sizeof(float), 1, f)==1);
^
./distance.h:50:83: note: put the semicolon on a separate line to silence this warning
clang -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -Wall -pedantic -fPIC -Wall -mtune=core2 -g -O2 -c tmcn_word2vec.c -o tmcn_word2vec.o
12 warnings generated.
In file included from tmcn_word2vec.c:3:
./word2vec.h:321:51: warning: if statement has empty body [-Wempty-body]
if(fscanf(fin, "%lld%c", &vocab[a].cn, &c)==1);
^
CMD check also warns about the following:
File ‘wordVectors/libs/wordVectors.so’:
Found ‘___stdoutp’, possibly from ‘stdout’ (C)
Object: ‘tmcn_word2vec.o’
Found ‘_exit’, possibly from ‘exit’ (C)
Object: ‘tmcn_word2vec.o’
Found ‘_printf’, possibly from ‘printf’ (C)
Objects: ‘tmcn_distance.o’, ‘tmcn_word2vec.o’
Found ‘_puts’, possibly from ‘printf’ (C), ‘puts’ (C)
Objects: ‘tmcn_distance.o’, ‘tmcn_word2vec.o’
It means your C code is calling stdout()
, printf()
, exit()
, puts()
which are not allowed because again, they can crash R.
Instead of printf
you might be able to use Rprintf
. R code should never call exit()
because that will quit R. Instead your code should raise an error Rf_error
when something is wrong.
You are using the .C
interface to call your C code from R. This interface is very old and should not be used anymore for new packages. It is highly recommended to use the .Call
system instead, which allows you to do type checking and manipulate R objects on the C level.
You borrow examples from almost any other popular R package with C code (digest, jsonlite, etc)
You are overriding compiler settings for windows in your Makevars.win file. You should never do this unless you know what you are doing. Please delete the PKG_CFLAGS
line from your Makevars.win
file.
Thanks, @jeroenooms! @bmschmidt, it will be easier to get through the rest of the review with these issues fixed. Please let us know if we can help with addressing them.
I'd also suggest switching to warnings_are_errors: true
in your .travis.yml
for more thorough checks.
Thanks everyone for signing on here. It's a great set of reviewers, excited to hear from you all.
To address @noamross's questions I've removed warnings_are_errors, and changed a little bit in the testing to up testing coverage to 48%. It now fails under R 3.4 (or whatever devel
is on travis) with a new error about vignette documentation, but builds under the current version.
Thank you for these notes, @jeroenooms, they are very helpful. In reading them I realized I added some code after submitting here that adding a lot of cruft to the checks after submitting; my apologies for that.
I've fixed a lot of this in the latest commit, including compiler warnings, library calls, and improper usage of stdout and fprint.
There are two issues that I haven't fixed and would appreciate any advice on.
I haven't yet fixed this because it looks like a fair amount of work for me on the C side, since I don't really know what I'm doing there, without a payoff. I've read some about the advantages of .Call
over .C
, and they don't really seem to apply here. Everything being passed in the .C
calls are strings or single numbers substituting for command-line flags in the original code; it's never more than a couple dozen bytes, so there's no performance effect.
If I were more comfortable interfacing C and R code I would switch this right away. But to rewrite functions to use .Call
seems to requiring learning a bit about internals that I'm not interested in right now, and debugging is difficult because my whole installation crashes on a mistake. (Do I need to run inside a C debugger, or what?)
So I guess I'm asking: how serious an issue this is?
The Makevars.win override was a pull request from someone that may have fixed a lot of Windows errors I was hearing about. So while I don't know what I'm doing, it's possible that the contributor did. I will try removing it once I get some integration tests on Windows running using @noamross's suggestions above and see what happens.
Ugh. Testing out the vignettes I realized that this latest batch of changes to the C code seems to have created a memory leak that crashes R on anything other than toy data. So maybe I have to learn more C debugging than I wanted after all. Just an FYI before anyone tries to run this.
OK, that issue is now fixed, but updates on builds are no longer printed to the console. I've added an issue within the repo here; if anyone knows why Rprintf
would crash R through a memory leak where printf
didn't (the opposite of what's supposed to happen?) I'd welcome any comments there or here.
I am so happy to be reviewing this package! I have enjoyed playing around with its functionality.
> model %>% nearest_to(model[["darcy"]] - model[["pemberley"]] + model[["hartfield"]])
knightley hartfield elton martin weston harriet emma knightley's woodhouse emma's
0.4152193 0.4363070 0.4769714 0.4921517 0.5070818 0.5139842 0.5250399 0.5654097 0.5708246 0.5861962
> model %>% nearest_to(model[["darcy"]] - model[["pemberley"]] + model[["norland"]])
norland elinor devonshire marianne dashwood divide sussex unavoidable mother ferrars
0.3098889 0.5843901 0.5943843 0.6005002 0.6231055 0.6261682 0.6280522 0.6293118 0.6393082 0.6394483
> cosineSimilarity(model[[c("darcy","knightley","tilney"), average=FALSE]],
model[[c("elizabeth","emma","catherine"), average=FALSE]])
emma elizabeth catherine
darcy 0.02762390 0.67302184 0.1003745
knightley 0.69530175 -0.07843539 -0.1304497
tilney 0.07108335 0.12679012 0.637369
The package includes all the following forms of documentation:
URL
, Maintainer
and BugReports
fields in DESCRIPTION Estimated hours spent reviewing: 2
warnings_are_errors
set to FALSE
on Travis; that needs to be taken out.nearest_to
, it seems backwards to me that the vectors start at the lowest of the n = 10
closest and go up to the highest. I would have expected to see the actual nearest words first.URL
and BugReports
, and there needs to be contribution guidelines in the README or a separate file (you can use devtools::use_code_of_conduct()
if you'd like).\dontrun
. This is an important component of making a package robust. I am not familiar with effective strategies for testingcompiled code so someone else will need to chime in with help on that, but using testthat for the functions written in R is a good way to go. You could make small word vector objects to use in tests and examples, much smaller than the files you use for the README.Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
URL
, Maintainer
and BugReports
fields in DESCRIPTION Estimated hours spent reviewing: 2
I don't have much experience with R package development, so I haven't focused on unit tests or packaging guidelines in these comments. I think the other reviewers in this group will address those questions more effectively.
Instead I've looked at the package from the perspective of a user interested in text mining.
The need for the package is clear; the audience of scholars interested in word2vec is definitely larger than the audience comfortable with existing implementations. And I do think this implementation successfully streamlines the task of training, and applying, a word2vec model. In particular, the inclusion of a function that prepares a corpus for training is an important bridge over an otherwise tricky phase of this process.
Installation was straightforward, and I had no difficulty following instructions to prepare a corpus and train a model. I encountered no warnings or errors.
I did find that I had some unanswered questions. It’s not difficult to extract a vocabulary from the VectorSpaceModel object, but I found myself wondering whether this is an absolutely complete vocabulary for the corpus, or whether there is some minimum frequency threshold. If it’s the former, nothing needs to be added beyond a single brief sentence somewhere in the documentation. Users may also wonder about the frequency threshold for fusing common bigrams.
The vignette and README files were very informative, and guided me successfully through early stages of EDA on the VectorSpaceModel (e.g. the kmeans example is very useful). But I did find myself wishing that I also had a document like the .pdf package descriptions I’m accustomed to from CRAN, which simply list all the functions available in a given package. Of course, the github repo itself does this. I was able to read the vignette on github, but not actually able to invoke it from within RStudio.
In preparing my corpus, prep_word2vec() turned all linebreaks into NA. NA then also appeared as the second most common “word” in the vocabulary, and as a separate row in the matrix. This may be a standard aspect of corpus prep rather than a bug, but I think it’s possible that some users will be puzzled by it — so, even if it is expected behavior, some explanation should probably be provided.
(If it's not expected behavior, I could share the files to help reproduce it. I think these were just utf-8 txt files; I was running R 3.3.2 under Mac OS 10.12.3.)
Generally, though, this strikes me as a trouble-free, well-documented implementation of an important emerging method, likely to get wide usage.
Thanks so much for your comments, @tedunderwood and @juliasilge. Your time is muc appreciated.
I really hate to say this, but I believe, in reading @juliasilge's comments, that she/you must have tested the master
branch which is several commits behind dev
. Most of the problems are now fixed, including some greater coverage on the unit tests. I feel very bad about this, because it's obviously my fault for burying such an awkward request in the submission issue here. Sorry!
So, @juliasilge: I don't want to make you spend any more time, but given how much I appreciate your super-useful work with tidytext, as long as I've got you here I just want to flag for you the dev
-branch vignette specifically about exploratory analysis (http://htmlpreview.github.io/?https://github.com/bmschmidt/wordVectors/blob/dev/inst/doc/exploration.html) which uses Hadleyverse-like formula interfaces on text instead of double brackets for access: eg, vectorset %>% nearest_to(~"good"*3 + "bad)
. I don't know that it's super important, but it is a neat syntactic trick that seems to highlight some R-language advantages specifically over python. And you might be happy that the output from nearest_to
is now a data.frame, which means it could be easily integrated into a tidytext chain (eg, instead of using a pre-built sentiment dictionary, tag sentiment by polarity along the vector "good" - "bad"
). There's also a bigger introductory vignette is here which isn't very different from the old README.md.
I'm not sure what the next steps are: there are several issues above to address. Would it be helpful for me to pull what I see as the unresolved ones into a comment, @noamross?
I'm just going to post a partial list here and update it as time goes on. Hopefully that method won't spam anyone. I guess I could file these as issues on my repo, too.
From @jeroenooms
.C
vs .Call
.From @juliasilge
From @tedunderwood
min_count
documentation in train_word2vec
to explain Question 1.Thanks for your reviews, @tedunderwood and @juliasilge. Sorry that this process has been a bit more convoluted than usual. @jeroenooms was waiting on getting the C issues fixed to the point of passing R CMD Check before completing review and doing a checklist (Jeroen, perhaps you can comment on https://github.com/bmschmidt/wordVectors/issues/29). So @bmschmidt, I suggest you proceed fixing the issues listed above, and when you've addressed them merge to master
for second round comments.
@bmschmidt Any updates on the above?
@bmschmidt Just checking in on your progress here. Are you planning on revising, and do you have an expected timeline?
apply holding
label?
Closing this submission as it has been dormant for > 1 yr. Authors may re-submit if they would like to continue the process.
Summary
What does this package do? (explain in 50 words or less):
This package wraps Google's implementation in C for training word2vec models, and provides several R functions for exploratory data analysis of word2vec and other related models. These include import-export from binary, basic linear algebra missing from R, and a streamlined syntax for working with models and performing vector arithmetic.
Paste the full DESCRIPTION file inside a code block below:
https://github.com/bmschmidt/wordVectors/tree/dev
(Note that this is a dev branch of the repo I'm submitting: I've made a couple breaking changes to the API for review, but don't want to inconvenience current users if more are in the pipeline.)
People working with large textual collections who want to explore them with word2vec models they train themselves.
People doing NLP tasks who want to incorporate existing models (like Google's news vectors) into their pipeline.
Workshops, classrooms, and other situations where people want to promote exploration of word2vec or other models.
I built off of an existing package here on which development seems to be abandoned. My package has fewer bugs, and offers access to more of the underlying bindings. There's another package offering access to all the word2vec bindings here; it's maintainer has contributed patches to this project, and I think their functionality is roughly comparable on the core functions.
The big difference is that unlike either of these, my package offers import/export in the binary format, and wrappers for cosine similarity and the like, with syntactic sugar for word2vec models.
Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
with a high-level description in the package root or ininst/
.(how do I get a long-term repository for JOSS?)
Detail
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:Two warnings:
Number one has only recently appeared: the instructions (
R CMD build --resave-data
) seem not to work.More seriously (?), some of the word2vec C code throws warnings because they use read/write functions. I don't know R's C bindings well enough to fix these; many people are currently using it in production.
Build is tested on Ubuntu and OS X. I've also heard reports of builds failing on certain Windows versions. There's been much less of this since a few recent patches and I believe it to be fixed, but I don't have access to a test suite across old Windows versions to be sure.
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
@lmullen has used the package and has contributed to you all a lot. He could perhaps suggest other uses.