Closed maelle closed 8 years ago
Thanks for your submission @masalmon - We'll get back to you soon
Hi @masalmon, I'll be your handling editor and am currently seeking reviewers. FYI, this submission prompted discussion among the editors on two topics:
pdftools
and tabulizer
being examples). We've added a new category to our policies document to explicitly include packages, API wrappers or not, in this category, without expanding to "any API wrapper".Cool, thanks. Actually geoparser
also extracts data from unstructured text & depends on a processing-as-a-service startup although it's a geospatial thing too.
I know I'm creating work for myself later with API packages. ropenaq
also breaks when the API changes... I hope I'll be able to cope & that I'll be able to find replacements for the APIs which will disappear :smile_cat:
Reviewers: @hilaryparker @leeper Due date: 2016-06-15
@hilaryparker Due date: 2016-06-15 - hey there, it's been 21 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)
@hilaryparker Due date: 2016-06-15 - hey there, it's been 29 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)
@hilaryparker Due date: 2016-06-15 - hey there, it's been 45 days, please get your review in soon, thanks :smiley_cat: (ropensci-bot)
Additional reviewer: @leeper
This is a slim and easy-to-use package that works exactly as described. The documentation is generally complete and all examples in docs and README run as expected. I think it's really close to being ready for acceptance, but there are a couple of issues and I have some suggestions on usability.
I get an error when building vignettes from GitHub (build_vignettes = TRUE
):
Quitting from lines 35-41 (monkeylearn_intro.Rmd)
Error: processing vignette 'monkeylearn_intro.Rmd' failed with diagnostics:
HTTP failure: 401
Invalid token header. No credentials provided.
Execution halted
This seems to be due to requiring an API key for examples to run. For CRAN release, the code may need to be pre-executed so the vignettes can build. This issue also emerges on R CMD check
in a few places:
\dontrun{}
if()
based on API key availability, etc.).I also get a NAMESPACE note in R CMD check
:
* checking R code for possible problems ... NOTE
monkeylearn_check: no visible binding for global variable 'text'
monkeylearn_parse: no visible global function definition for 'is'
Undefined global functions or variables:
is text
Consider adding
importFrom("graphics", "text")
importFrom("methods", "is")
to your NAMESPACE file (and ensure that your DESCRIPTION Imports field
contains 'methods').
That is all pretty minor.
monkeylearn_extract()
and monkeylearn_classify()
. For both functions, the output is a list of two tbl_dfs. Only the first one "results"
is really useful. The second, "headers"
, is basically request metadata. I think it would be more useful to return a single tbl_df and leave the "headers"
information as attributes to that, in the form of a list. This will improve efficiency because the headers won't be run through both as.data.frame()
and tbl_df()
and will make the results easier to access.headers
response seems to be to track the x.query.limit.remaining
HTTP header. I would handle this in two ways, both different from current behavior:
verbose = TRUE
.monkeylearn_get_extractor()
and monkeylearn_get_classify()
update the value after each request and then issue a warning when these values get low.message()
rather than print()
for diagnostics when verbose = TRUE
. This occurs in several places:
"ex_isnnZRbS"
as the value of extractor_id
ala monkeylearn_extract(request = text, extractor_id = "ex_y7BPYzNG")
. If this could instead be something like monkeylearn_extract(request = text, extractor_id = .entity)
, monkeylearn_classify(request = text, classifier_id = .profanity)
, etc. it would greatly simplify the use of the most commonly needed methods.One really small code style comment: on if ()
, while ()
, etc. statements, I think Hadley, Google, and most style guides recommend spacing between brackets, so that:
while(class(output) == "try-error" && try_number < 6){
...
}
becomes
while (class(output) == "try-error" && try_number < 6) {
...
}
Overall, really easy to use and really nicely done.
Thank your for the review, @leeper!
@masalmon, you may yet receive another review, but go ahead and proceed with changes in the meantime.
One suggestion regarding outputting tbl_df()
's from your functions. Per the update to tibble and this comment from Hadley, it may make sense to replace tbl_df()
(as well as as.data.frame()
) with tibble::as_tibble()
to anticipate future changes in the tidyverse. This won't increase the package weight as dplyr already depends on tibble. If you do so, specify tibble (>= 1.1)
in Imports:
.
Many thanks!! 👍 I am on holidays this week so will implement the suggested changes next week. ☺
Hi @leeper, thanks again, this will be very useful!
I have only questions about 2 points:
I had chosen to output a list of data frames because I did it this way for ropenaq
but obviously I'm open to changes. :grinning:
@masalmon R objects have an "attributes" component that stores metadata such as dimensions and names, but also can hold arbitrary R objects as metadata. See this section in Advanced R, the R Manual section, ?attributes
and ?attr
. So you can attach the headers to your output and access it with attributes(output)$headers
or attr(output, "headers")
. I believe @leeper is suggesting that since this header data isn't inherently tabular and consists of a few values, just leaving it as a list makes more sense than converting it to a data frame/tibble.
@noamross cool, thank you!
Okay so I now see why it (maybe) made sense to have headers
as a tibble: when the input is a vector of texts (and not a single text), the headers
is a table with one line per original text. It would be harder to have this if headers
were a list. I had the same strategy in geoparser
.
I'd tend to keep this because I don't know any better solution for this case, and if I do, I'd replace the current "text index" in the headers
and results
tables by a md5 checksum as I currently do in geoparser
.
Any thought?
Reg. "The only major substantive suggestion I have for the package is to perhaps create some functions or variables to represent the classifier ID and extractor ID values for some common methods.", I'm trying to get a table of all public modules with their ID and name, if Monkeylearn can send me one (I already know it cannot be obtained automatically). If they can't, I'll do a small table myself and use it.
Monkeylearn has actually started working on an API endpoint for finding public modules so I'll wait for its being ready for updating the package 😊
Sorry for the being incommunicado on this. What I meant in terms of headers is - as @noamross describes - that I would have the functions return something like:
structure(data.frame(...), headers = list(...))
rather than:
structure(list(results = data.frame(...), headers = data.frame(...)))
It seems like the headers are secondary information in most cases, so this simplifies accessing the results. Whether results are a df or tibble...either way seems fine.
Sounds good on waiting to implement a list of modules!
No problem & thanks for your answer!
But in the case where the output data.frame is actually obtained from binding results from several calls to the API, would you expect the headers to be a list of list?
How about having the headers as a df/tibble, but still as an attribute of the results?
On Tue, Jul 26, 2016 at 9:21 AM Maëlle Salmon notifications@github.com wrote:
No problem & thanks for your answer!
But in the case where the output data.frame is actually obtained from binding results from several calls to the API, would you expect the headers to be a list of list?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/45#issuecomment-235265014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAi5aKX0c6dzab3sdJrqWK2YHQZkU-gCks5qZgnLgaJpZM4IiCUj .
That sounds like a good compromise!
Yup, that works. Then there's a consistent structure regardless of whether it's one call or many.
The function now outputs a data.frame with headers as a df attribute. :smile:
I am still not sure about the following part of my code in both functions:
output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
# for the case when the server returns nothing
# try 5 times, not more
try_number <- 1
while(class(output) == "try-error" && try_number < 6) {
message(paste0("Server returned nothing, trying again, try number", i))
Sys.sleep(2^try_number)
output <- tryCatch(monkeylearn_get_extractor(request_part, key, extractor_id))
try_number <- try_number + 1
}
# check the output -- if it is 429 try again (throttle limit)
while(!monkeylearn_check(output)) {
output <- monkeylearn_get_extractor(request_part, key, extractor_id)
}
So if the server returns nothing I re-try up to 5 times with each time more waiting time. And if the response from the server has a 429 code which indicates throttle limit, I wait 60 seconds.
I feel this is quite clumsy, maybe I could make this better? Any suggestion would be welcome.
Note that I read https://cran.r-project.org/web/packages/httr/vignettes/api-packages.html "For many APIs, the common approach is to retry API calls that return something in the 500 range. However, when doing this, it’s extremely important to make sure to do this with some form of exponential backoff: if something’s wrong on the server-side, hammering the server with retries may make things worse, and may lead to you exhausting quota (or hitting other sorts of rate limits). A common policy is to retry up to 5 times, starting at 1s, and each time doubling and adding a small amount of jitter (plus or minus up to, say, 5% of the current wait time)."
Now another question. @leeper now that headers
is a part of the output as an attribute of the data.frame, do you still think reporting x.query.limit.remaining HTTP
when verbose=TRUE
makes sense? Now I feel that the users themselves could implement something in their code based on x.query.limit.limit
and x.query.limit.remaining
, which I will make more explicit in the documentation.
I'd say write an example of how to check rate limiting and if it's really complicated, perhaps write it into the package. If it turns out to be simple, then just leave it as an example and/or note it in the README.
I feel I can now answer the review comments. Thanks a lot, @leeper! And thanks @noamross for your contributions to the discussion!
Something I have done independently of this feedback was to add a small vignette where I do a small analysis of the book Little Women.
monkeylearn_classify
and monkeylearn_extract
and in the README+vignette I explain how to get the remaining number of calls, so the user could use them as they wish.print
with message
.monkeylearn_classifiers
which returns a data.frame of all classifiers, or only of private ones, with their IDs, names and other information such as the language. So the user could use this table to extract the IDs of the classifiers they want to use.Thanks again for the feedback, I'm really happy to be able to improve the package!
Nice. Looks good to me!
@leeper I had added you to the DESCRIPTION file without asking first, I'm sorry. Are you ok with being in the DESCRIPTION?
Sure thing.
I was running final checks and I'm getting an error sometimes when running line 94 of the guternbergr vignette. It looks like the vignettes don't get tested on Travis?:
> keywords <- monkeylearn_extract(request = little_women_chapters$text,
extractor_id = "ex_y7BPYzNG",
params = list(max_keywords = 3))
Error in match.names(clabs, names(xi)) :
names do not match previous names
This appears to arise in extractor.R#97
: headers <- rbind(headers, output$headers)
. When I dive in using options(error=recover)
I find that these two objects have different names:
Browse[1]> str(headers)
'data.frame': 1 obs. of 11 variables:
$ allow : Factor w/ 1 level "POST, OPTIONS": 1
$ content.type : Factor w/ 1 level "application/json": 1
$ date : Factor w/ 1 level "Tue, 02 Aug 2016 22:38:54 GMT": 1
$ server : Factor w/ 1 level "nginx/1.8.0": 1
$ vary : Factor w/ 1 level "Accept, Cookie": 1
$ x.query.limit.limit : Factor w/ 1 level "50000": 1
$ x.query.limit.remaining : Factor w/ 1 level "49101": 1
$ x.query.limit.request.queries: Factor w/ 1 level "20": 1
$ content.length : Factor w/ 1 level "11274": 1
$ connection : Factor w/ 1 level "keep-alive": 1
$ text_md5 :List of 1
..$ : chr "ff08da4717ca6fadcc5496aa51aad2ea" "0c368463a9e7135209a3ac1eaf3e6120" "1edee1e8747131a45212df1bc007001d" "b765601afb831ea7adb5a89b2d3f5803" ...
Browse[1]> str(output$headers)
'data.frame': 1 obs. of 11 variables:
$ allow : Factor w/ 1 level "POST, OPTIONS": 1
$ content.type : Factor w/ 1 level "application/json": 1
$ date : Factor w/ 1 level "Tue, 02 Aug 2016 22:48:50 GMT": 1
$ server : Factor w/ 1 level "nginx/1.8.0": 1
$ vary : Factor w/ 1 level "Accept, Cookie": 1
$ x.query.limit.limit : Factor w/ 1 level "50000": 1
$ x.query.limit.remaining : Factor w/ 1 level "49081": 1
$ x.query.limit.request.queries: Factor w/ 1 level "20": 1
$ transfer.encoding : Factor w/ 1 level "chunked": 1
$ connection : Factor w/ 1 level "keep-alive": 1
$ text_md5 :List of 1
..$ : chr "144b8166ed539c3f712583460143d534" "c562796c5666bb759bbf97b9bc2963c9" "55b5bd0e6b4cd459e574db7e86b88407" "69c7ebf70662c78722ba1a7fccea1de2" ...
One has transfer.encoding
and one has content.length
, which have different information. I don't know why these are different in different calls. If they should be, dplyr::bind_rows()
can handle this and just put NA
s for missing values.
Oh thanks a lot @noamross , fixing this. Actually vignettes seem to get tested on Travis: https://travis-ci.org/masalmon/monkeylearn/builds/148549225
@feconroses is it normal for headers to sometimes contain transfer.encoding and sometimes content.length?
@masalmon in all honesty, these headers aren't really relevant from a MonkeyLearn point of view. These headers (transfer.encoding and content.length) are generated automatically by nginx, they aren't really relevant for the text analysis made by MonkeyLearn or its process.
@feconroses ok thanks a lot!
@noamross has my commit solved the bug on your PC?
Yes, that has solved it, all checks now passing!
I hope you don't mind one more tweak at the end of a long review:. We're starting to use Gabor Csardi's goodpractice package to check for antipatterns in code. Running goodpractice::gp()
yielded two small suggestions. After these you should be good to go.
✖ avoid sapply(), it is not type safe. It might return a vector, or a list, depending on the input data.
Consider using vapply() instead.
R/utils.R:97:44
R/utils.R:105:25
R/utils.R:112:28
✖ avoid 1:length(...), 1:nrow(...), 1:ncol(...), 1:NROW(...) and 1:NCOL(...) expressions. They are error
prone and result 1:0 if the expression on the right hand side is zero. Use seq_len() or seq_along()
instead.
R/classify.R:45:12
R/extractor.R:70:12
Oh nice I didn't know about this package, I'll soon make the corresponding corrections. Thank you!
@noamross now done! :-)
Thanks for your patience and help, @noamross, now I really got rid of sapply
.
Accepted! Thank you @masalmon and @leeper for your review and follow-up. @masalmon, I believe you know the drill from here.
Awesome! Yes I do. Thanks again @leeper and @noamross !
@sckott could you please create the monkeylearn
project in the ropenscilabs Appveyor account and add me as an user? thank you!
Thanks @masalmon, you are awesome and thanks for making this happen! @leeper, @noamross and @sckott thanks for your time and help :)
@masalmon the appveyor is set up
Wow this was quick, thanks a lot @sckott !
For info the package is on CRAN now. Thanks again for the review!
Well done!
On Sun, 11 Sep 2016, 17:35 Maëlle Salmon notifications@github.com wrote:
For info the package is on CRAN now. Thanks again for the review!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ropensci/onboarding/issues/45#issuecomment-246166788, or mute the thread https://github.com/notifications/unsubscribe-auth/AD6tb4yotFZ8lV3DRWPM7KZQpsHF4H5Pks5qo69CgaJpZM4IiCUj .
Dr. Michael Sumner Software and Database Engineer Australian Antarctic Division 203 Channel Highway Kingston Tasmania 7050 Australia
This package provides access to the existing modules of the monkeylearn API, allowing for instance Named Entity Recognition, language detection, attribution of topics to texts.
https://github.com/masalmon/monkeylearn
It works with monkeylearn API http://monkeylearn.com/ Note that one can register using one's Github account.
Anyone that wants to use Natural Language Processing methods without e.g. first installing Java, or someone that already uses monkeylearn and wants to integrate it in a R workflow (but only for text processing, not for module development)
Yes the openNLP package provides several Natural Language Processing methods such as Named Entity Recognition but has a Java dependency that makes its installation difficult for some users.
devtools
install instructionsdevtools::check()
produce any errors or warnings? If so paste them below.It's not an explanation to any exception but I wanted to underline I'm not sure I've chosen the best solution for dealing with the throttle limit (I retry 5 times if there is a 429 API error which is throttle limit), so I'd very much appreciate feedback on this (and on anything else, obviously!).