monkeylearn: natural language processing with the monkeylearn API

maelle commented 8 years ago

1. What does this package do? (explain in 50 words or less)
  This package provides access to the existing modules of the monkeylearn API, allowing for instance Named Entity Recognition, language detection, attribution of topics to texts.
1. Paste the full DESCRIPTION file inside a code block below.

Package: monkeylearn
Type: Package
Title: Access to the Monkeylearn API for text classifiers and extractors
Version: 0.1.0
Authors@R: person("Maëlle", "Salmon", email = "maelle.salmon@yahoo.se", role = c("aut", "cre"))
Description: The package allows using some services of Monkeylearn which is
    a Machine Learning platform on the cloud that allows software companies and
    developers to easily extract actionable data from text.
License: GPL (>= 2)
LazyData: TRUE
URL: http://github.com/masalmon/monkeylearn
BugReports: http://github.com/masalmon/monkeylearn/issues
Encoding: UTF-8
RoxygenNote: 5.0.1
Suggests: testthat,
    knitr,
    rmarkdown
Imports: jsonlite, dplyr, httr
VignetteBuilder: knitr

1. URL for the package (the development repository, not a stylized html page)
  https://github.com/masalmon/monkeylearn
1. What data source(s) does it work with (if applicable)?
  It works with monkeylearn API http://monkeylearn.com/ Note that one can register using one's Github account.
1. Who is the target audience?
  Anyone that wants to use Natural Language Processing methods without e.g. first installing Java, or someone that already uses monkeylearn and wants to integrate it in a R workflow (but only for text processing, not for module development)
1. Are there other R packages that accomplish the same thing? If so, what is different about yours?
  Yes the openNLP package provides several Natural Language Processing methods such as Named Entity Recognition but has a Java dependency that makes its installation difficult for some users.
1. Check the box next to each policy below, confirming that you agree. These are mandatory.
  - [ x] This package does not violate the Terms of Service of any service it interacts with.
  - [ x] The repository has continuous integration with Travis CI and/or another service
  - [ x] The package contains a vignette
  - [x ] The package contains a reasonably complete README with devtools install instructions
  - [ x] The package contains unit tests
  - [ x] The package only exports functions to the NAMESPACE that are intended for end users
1. Do you agree to follow the rOpenSci packaging guidelines? These aren't mandatory, but we strongly suggest you follow them. If you disagree with anything, please explain.
  - [ ] Are there any package dependencies not on CRAN?
  - [ x] Do you intend for this package to go on CRAN?
  - [ x] Does the package have a CRAN accepted license?
  - [ ] Did devtools::check() produce any errors or warnings? If so paste them below.
1. Please add explanations below for any exceptions to the above:
  It's not an explanation to any exception but I wanted to underline I'm not sure I've chosen the best solution for dealing with the throttle limit (I retry 5 times if there is a 429 API error which is throttle limit), so I'd very much appreciate feedback on this (and on anything else, obviously!).
1. If this is a resubmission following rejection, please explain the change in circumstances.

sckott commented 8 years ago

Thanks for your submission @masalmon - We'll get back to you soon

noamross commented 8 years ago

Hi @masalmon, I'll be your handling editor and am currently seeking reviewers. FYI, this submission prompted discussion among the editors on two topics:

Fit: Previous API wrappers we've accepted had fit under accessing and depositing in databases, geospatial processing, and plotting services. This package fit none of those, but clearly fit in with a growing theme - extracting data from unstructured sources (pdftools and tabulizer being examples). We've added a new category to our policies document to explicitly include packages, API wrappers or not, in this category, without expanding to "any API wrapper".
API Stability: With so many processing-as-a-service startups in the market, its inevitable that APIs will go through periods of instability and maturation, and many may be short-lived, and thus package maintenance requirements may be high or packages may become obsolete. We came to no conclusion from this, but will be keeping an eye on API wrapper packages to see how our system copes with this churn.

maelle commented 8 years ago

Cool, thanks. Actually geoparser also extracts data from unstructured text & depends on a processing-as-a-service startup although it's a geospatial thing too.

I know I'm creating work for myself later with API packages. ropenaq also breaks when the API changes... I hope I'll be able to cope & that I'll be able to find replacements for the APIs which will disappear :smile_cat:

noamross commented 8 years ago

Reviewers: @hilaryparker @leeper Due date: 2016-06-15

sckott commented 8 years ago

@hilaryparker Due date: 2016-06-15 - hey there, it's been 21 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)

ropenscibot commented 8 years ago

@hilaryparker Due date: 2016-06-15 - hey there, it's been 29 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)

ropenscibot commented 8 years ago

@hilaryparker Due date: 2016-06-15 - hey there, it's been 45 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)

noamross commented 8 years ago

Additional reviewer: @leeper

leeper commented 8 years ago

This is a slim and easy-to-use package that works exactly as described. The documentation is generally complete and all examples in docs and README run as expected. I think it's really close to being ready for acceptance, but there are a couple of issues and I have some suggestions on usability.

I get an error when building vignettes from GitHub (build_vignettes = TRUE):
```
Quitting from lines 35-41 (monkeylearn_intro.Rmd) 
Error: processing vignette 'monkeylearn_intro.Rmd' failed with diagnostics:
HTTP failure: 401
Invalid token header. No credentials provided.
Execution halted
```
This seems to be due to requiring an API key for examples to run. For CRAN release, the code may need to be pre-executed so the vignettes can build. This issue also emerges on R CMD check in a few places:
1. examples, which should probably be wrapped in \dontrun{}
2. test suite, which probably needs to be run conditionally using whichever CRAN-identifying method is appropriate (version number, testthat's check, a simple if() based on API key availability, etc.).

I also get a NAMESPACE note in R CMD check:

* checking R code for possible problems ... NOTE
monkeylearn_check: no visible binding for global variable 'text'
monkeylearn_parse: no visible global function definition for 'is'
Undefined global functions or variables:
  is text
Consider adding
  importFrom("graphics", "text")
  importFrom("methods", "is")
to your NAMESPACE file (and ensure that your DESCRIPTION Imports field
contains 'methods').

That is all pretty minor.

The code itself also looks good and is very straightforward, seeing as there are only two functions: monkeylearn_extract() and monkeylearn_classify(). For both functions, the output is a list of two tbl_dfs. Only the first one "results" is really useful. The second, "headers", is basically request metadata. I think it would be more useful to return a single tbl_df and leave the "headers" information as attributes to that, in the form of a list. This will improve efficiency because the headers won't be run through both as.data.frame() and tbl_df() and will make the results easier to access.
The only purpose for the headers response seems to be to track the x.query.limit.remaining HTTP header. I would handle this in two ways, both different from current behavior:
1. Report this value when verbose = TRUE.
2. Store the value in a non-exported package environment and have internal functions monkeylearn_get_extractor() and monkeylearn_get_classify() update the value after each request and then issue a warning when these values get low.
The functions should use message() rather than print() for diagnostics when verbose = TRUE. This occurs in several places:
The only major substantive suggestion I have for the package is to perhaps create some functions or variables to represent the classifier ID and extractor ID values for some common methods. For example, the README and examples require the user to input something like "ex_isnnZRbS" as the value of extractor_id ala monkeylearn_extract(request = text, extractor_id = "ex_y7BPYzNG"). If this could instead be something like monkeylearn_extract(request = text, extractor_id = .entity), monkeylearn_classify(request = text, classifier_id = .profanity), etc. it would greatly simplify the use of the most commonly needed methods.
One really small code style comment: on if (), while (), etc. statements, I think Hadley, Google, and most style guides recommend spacing between brackets, so that:
```
while(class(output) == "try-error" && try_number < 6){
...
}
```
becomes
```
while (class(output) == "try-error" && try_number < 6) {
...
}
```
It might be useful to note in the README that MonkeyLearn supports registration through GitHub, which makes the registration process really easy.
It might make the README a little flashier if it had some graphs. That's totally not necessary at all, but it seems like that captures peoples' attention better than pure text.

Overall, really easy to use and really nicely done.

noamross commented 8 years ago

Thank your for the review, @leeper!

@masalmon, you may yet receive another review, but go ahead and proceed with changes in the meantime.

noamross commented 8 years ago

One suggestion regarding outputting tbl_df()'s from your functions. Per the update to tibble and this comment from Hadley, it may make sense to replace tbl_df() (as well as as.data.frame()) with tibble::as_tibble() to anticipate future changes in the tidyverse. This won't increase the package weight as dplyr already depends on tibble. If you do so, specify tibble (>= 1.1) in Imports:.

maelle commented 8 years ago

Many thanks!! 👍 I am on holidays this week so will implement the suggested changes next week. ☺

maelle commented 8 years ago

Hi @leeper, thanks again, this will be very useful!

I have only questions about 2 points:

I'm not sure I understand what you mean by "I think it would be more useful to return a single tbl_df and leave the "headers" information as attributes to that, in the form of a list"? Do you mean adding the information from request as columns to the results data frame instead of having them separately? So that there would be a column "x.query.limit.request.queries" with the same value for the whole data frame? But you say "list", so I'm pretty sure I'm misunderstanding your suggestion.
And does the following point mean you think that the remaining number of queries should not be part of the output, or that there should be an additional warning?

I had chosen to output a list of data frames because I did it this way for ropenaq but obviously I'm open to changes. :grinning:

noamross commented 8 years ago

@masalmon R objects have an "attributes" component that stores metadata such as dimensions and names, but also can hold arbitrary R objects as metadata. See this section in Advanced R, the R Manual section, ?attributes and ?attr. So you can attach the headers to your output and access it with attributes(output)$headers or attr(output, "headers"). I believe @leeper is suggesting that since this header data isn't inherently tabular and consists of a few values, just leaving it as a list makes more sense than converting it to a data frame/tibble.

maelle commented 8 years ago

@noamross cool, thank you!

maelle commented 8 years ago

Okay so I now see why it (maybe) made sense to have headers as a tibble: when the input is a vector of texts (and not a single text), the headers is a table with one line per original text. It would be harder to have this if headers were a list. I had the same strategy in geoparser.

I'd tend to keep this because I don't know any better solution for this case, and if I do, I'd replace the current "text index" in the headers and results tables by a md5 checksum as I currently do in geoparser.

Any thought?

maelle commented 8 years ago

Reg. "The only major substantive suggestion I have for the package is to perhaps create some functions or variables to represent the classifier ID and extractor ID values for some common methods.", I'm trying to get a table of all public modules with their ID and name, if Monkeylearn can send me one (I already know it cannot be obtained automatically). If they can't, I'll do a small table myself and use it.

maelle commented 8 years ago

Monkeylearn has actually started working on an API endpoint for finding public modules so I'll wait for its being ready for updating the package 😊

leeper commented 8 years ago

Sorry for the being incommunicado on this. What I meant in terms of headers is - as @noamross describes - that I would have the functions return something like:

structure(data.frame(...), headers = list(...))

rather than:

structure(list(results = data.frame(...), headers = data.frame(...)))

It seems like the headers are secondary information in most cases, so this simplifies accessing the results. Whether results are a df or tibble...either way seems fine.

Sounds good on waiting to implement a list of modules!

maelle commented 8 years ago

No problem & thanks for your answer!

But in the case where the output data.frame is actually obtained from binding results from several calls to the API, would you expect the headers to be a list of list?

noamross commented 8 years ago

How about having the headers as a df/tibble, but still as an attribute of the results?

On Tue, Jul 26, 2016 at 9:21 AM Maëlle Salmon notifications@github.com wrote:

No problem & thanks for your answer!

But in the case where the output data.frame is actually obtained from binding results from several calls to the API, would you expect the headers to be a list of list?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/45#issuecomment-235265014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAi5aKX0c6dzab3sdJrqWK2YHQZkU-gCks5qZgnLgaJpZM4IiCUj .

maelle commented 8 years ago

That sounds like a good compromise!

leeper commented 8 years ago

Yup, that works. Then there's a consistent structure regardless of whether it's one call or many.

maelle commented 8 years ago

The function now outputs a data.frame with headers as a df attribute. :smile:

I am still not sure about the following part of my code in both functions:

    output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
    # for the case when the server returns nothing
    # try 5 times, not more
    try_number <- 1
    while(class(output) == "try-error" && try_number < 6) {
      message(paste0("Server returned nothing, trying again, try number", i))
      Sys.sleep(2^try_number)
      output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
      try_number <- try_number + 1
    }

    # check the output -- if it is 429 try again (throttle limit)
    while(!monkeylearn_check(output)) {
      output <- monkeylearn_get_extractor(request_part, key, extractor_id)
    }

So if the server returns nothing I re-try up to 5 times with each time more waiting time. And if the response from the server has a 429 code which indicates throttle limit, I wait 60 seconds.

I feel this is quite clumsy, maybe I could make this better? Any suggestion would be welcome.

Note that I read https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html "For many APIs, the common approach is to retry API calls that return something in the 500 range. However, when doing this, it’s extremely important to make sure to do this with some form of exponential backoff: if something’s wrong on the server-side, hammering the server with retries may make things worse, and may lead to you exhausting quota (or hitting other sorts of rate limits). A common policy is to retry up to 5 times, starting at 1s, and each time doubling and adding a small amount of jitter (plus or minus up to, say, 5% of the current wait time)."

Now another question. @leeper now that headers is a part of the output as an attribute of the data.frame, do you still think reporting x.query.limit.remaining HTTP when verbose=TRUE makes sense? Now I feel that the users themselves could implement something in their code based on x.query.limit.limitand x.query.limit.remaining, which I will make more explicit in the documentation.

leeper commented 8 years ago

I'd say write an example of how to check rate limiting and if it's really complicated, perhaps write it into the package. If it turns out to be simple, then just leave it as an example and/or note it in the README.

maelle commented 8 years ago

I feel I can now answer the review comments. Thanks a lot, @leeper! And thanks @noamross for your contributions to the discussion!

Something I have done independently of this feedback was to add a small vignette where I do a small analysis of the book Little Women.

Great point! I have added a chunk at the beginning of both vignettes now.
Thanks, now there is no NOTE.
Now the output is a data.frame with a data.frame as attributes for representing the headers.
In the documentation of monkeylearn_classify and monkeylearn_extract and in the README+vignette I explain how to get the remaining number of calls, so the user could use them as they wish.
Thanks, I replaced print with message.
Thanks, this is a great suggestion. This is what I have done:
- For classifiers, when I asked again about a list of all modules, Monkeylearn was kind enough to add an endpoint, which they have not done for extractors because there are less of them. I added the function monkeylearn_classifiers which returns a data.frame of all classifiers, or only of private ones, with their IDs, names and other information such as the language. So the user could use this table to extract the IDs of the classifiers they want to use.
- For extractors, in the README and vignette I only mention that one can go to the website to browse existing extractors and then save their ID as an environment variable. The reason why I did not add a data.frame of some extractors myself as a .RData is that I would have to choose which characteristics of the extractors to write in this table and for some applications there are different extractors (e.g. sentiment analysis trained on Tweets vs on hotel reviews) and my list would soon be out of date (which by the way is bad for the README and vignette too)... and also I not-so-secretly hope there'll be an endpoint for getting all of them, which might happen when one can customize extractors. Is it ok if I simply wait for a future version of the API before making the extractors IDs easier to find?
I added the missing spaces.
I added the sentence "Note that MonkeyLearn supports registration through GitHub, which makes the registration process really easy.".
As soon as I find a really good idea, I'll add that (probably as a advertisement for a supplementary vignette about one use case)! My current idea is to maybe use the R interface to the Urban Dictionary's API, get random definitions, and look for profanity and abuse detection, before plotting how many of the definitions have a high score. I am open to any cooler suggestion! In the meantime, is it ok I don't have any graph in the README?

Thanks again for the feedback, I'm really happy to be able to improve the package!

leeper commented 8 years ago

Nice. Looks good to me!

maelle commented 8 years ago

@leeper I had added you to the DESCRIPTION file without asking first, I'm sorry. Are you ok with being in the DESCRIPTION?

leeper commented 8 years ago

Sure thing.

noamross commented 8 years ago

I was running final checks and I'm getting an error sometimes when running line 94 of the guternbergr vignette. It looks like the vignettes don't get tested on Travis?:

> keywords <- monkeylearn_extract(request = little_women_chapters$text,
                                 extractor_id = "ex_y7BPYzNG",
                                params = list(max_keywords = 3))
Error in match.names(clabs, names(xi)) : 
  names do not match previous names

This appears to arise in extractor.R#97: headers <- rbind(headers, output$headers). When I dive in using options(error=recover) I find that these two objects have different names:

Browse[1]> str(headers)
'data.frame':   1 obs. of  11 variables:
 $ allow                        : Factor w/ 1 level "POST, OPTIONS": 1
 $ content.type                 : Factor w/ 1 level "application/json": 1
 $ date                         : Factor w/ 1 level "Tue, 02 Aug 2016 22:38:54 GMT": 1
 $ server                       : Factor w/ 1 level "nginx/1.8.0": 1
 $ vary                         : Factor w/ 1 level "Accept, Cookie": 1
 $ x.query.limit.limit          : Factor w/ 1 level "50000": 1
 $ x.query.limit.remaining      : Factor w/ 1 level "49101": 1
 $ x.query.limit.request.queries: Factor w/ 1 level "20": 1
 $ content.length               : Factor w/ 1 level "11274": 1
 $ connection                   : Factor w/ 1 level "keep-alive": 1
 $ text_md5                     :List of 1
  ..$ : chr  "ff08da4717ca6fadcc5496aa51aad2ea" "0c368463a9e7135209a3ac1eaf3e6120" "1edee1e8747131a45212df1bc007001d" "b765601afb831ea7adb5a89b2d3f5803" ...
Browse[1]> str(output$headers)
'data.frame':   1 obs. of  11 variables:
 $ allow                        : Factor w/ 1 level "POST, OPTIONS": 1
 $ content.type                 : Factor w/ 1 level "application/json": 1
 $ date                         : Factor w/ 1 level "Tue, 02 Aug 2016 22:48:50 GMT": 1
 $ server                       : Factor w/ 1 level "nginx/1.8.0": 1
 $ vary                         : Factor w/ 1 level "Accept, Cookie": 1
 $ x.query.limit.limit          : Factor w/ 1 level "50000": 1
 $ x.query.limit.remaining      : Factor w/ 1 level "49081": 1
 $ x.query.limit.request.queries: Factor w/ 1 level "20": 1
 $ transfer.encoding            : Factor w/ 1 level "chunked": 1
 $ connection                   : Factor w/ 1 level "keep-alive": 1
 $ text_md5                     :List of 1
  ..$ : chr  "144b8166ed539c3f712583460143d534" "c562796c5666bb759bbf97b9bc2963c9" "55b5bd0e6b4cd459e574db7e86b88407" "69c7ebf70662c78722ba1a7fccea1de2" ...

One has transfer.encoding and one has content.length, which have different information. I don't know why these are different in different calls. If they should be, dplyr::bind_rows() can handle this and just put NAs for missing values.

maelle commented 8 years ago

Oh thanks a lot @noamross , fixing this. Actually vignettes seem to get tested on Travis: https://travis-ci.org/masalmon/monkeylearn/builds/148549225

@feconroses is it normal for headers to sometimes contain transfer.encoding and sometimes content.length?

feconroses commented 8 years ago

@masalmon in all honesty, these headers aren't really relevant from a MonkeyLearn point of view. These headers (transfer.encoding and content.length) are generated automatically by nginx, they aren't really relevant for the text analysis made by MonkeyLearn or its process.

maelle commented 8 years ago

@feconroses ok thanks a lot!

@noamross has my commit solved the bug on your PC?

noamross commented 8 years ago

Yes, that has solved it, all checks now passing!

I hope you don't mind one more tweak at the end of a long review:. We're starting to use Gabor Csardi's goodpractice package to check for antipatterns in code. Running goodpractice::gp() yielded two small suggestions. After these you should be good to go.

  ✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data.
    Consider using vapply() instead.

    R/utils.R:97:44
    R/utils.R:105:25
    R/utils.R:112:28

  ✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error
    prone and result 1:0 if the expression on the right hand side is zero. Use seq_len() or seq_along()
    instead.

    R/classify.R:45:12
    R/extractor.R:70:12

maelle commented 8 years ago

Oh nice I didn't know about this package, I'll soon make the corresponding corrections. Thank you!

maelle commented 8 years ago

@noamross now done! :-)

maelle commented 8 years ago

Thanks for your patience and help, @noamross, now I really got rid of sapply.

noamross commented 8 years ago

Accepted! Thank you @masalmon and @leeper for your review and follow-up. @masalmon, I believe you know the drill from here.

maelle commented 8 years ago

Awesome! Yes I do. Thanks again @leeper and @noamross !

maelle commented 8 years ago

@sckott could you please create the monkeylearn project in the ropenscilabs Appveyor account and add me as an user? thank you!

feconroses commented 8 years ago

Thanks @masalmon, you are awesome and thanks for making this happen! @leeper, @noamross and @sckott thanks for your time and help :)

sckott commented 8 years ago

@masalmon the appveyor is set up

maelle commented 8 years ago

Wow this was quick, thanks a lot @sckott !

maelle commented 8 years ago

For info the package is on CRAN now. Thanks again for the review!

mdsumner commented 8 years ago

Well done!

On Sun, 11 Sep 2016, 17:35 Maëlle Salmon notifications@github.com wrote:

For info the package is on CRAN now. Thanks again for the review!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/45#issuecomment-246166788, or mute the thread https://github.com/notifications/unsubscribe-auth/AD6tb4yotFZ8lV3DRWPM7KZQpsHF4H5Pks5qo69CgaJpZM4IiCUj .

Dr. Michael Sumner Software and Database Engineer Australian Antarctic Division 203 Channel Highway Kingston Tasmania 7050 Australia

ropensci / software-review

monkeylearn: natural language processing with the monkeylearn API #45