Closed petermeissner closed 8 years ago
Thanks for the submission :rocket: - seeking reviewers
Reviewers: @richfitz @Ironholds
As you may notice, most of these quibbles are well, quibbles. On the whole the package is excellent :). I'll try to put time into a more detailed review later today or early tomorrow.
Thanks for the ropensci infrastructure and thanks for the feedback - much-much-much appreciated.
1) Unless you own gmai.com there's a typo in your DESCRIPTION file ;p.
2) Do you need the alternate df implementations? They're fun but also a lot of extra code to replace stringsAsFactors = FALSE ;P.
3) The assignment here seems extraneous.
4) "did apply" to "applied" here
path_allowed()
(link) will response to unforseen permission constallations with a warning and return all data that led to the problem.
FALSE
for each path? I think that's totally fine behaviour, but I'd caution you, in that scenario, to use more formal/explanatory description text, noting that problematic data will be returned.
(On the subject of formality, working on a more formal review now)
On the whole: solid work! Some thoughts:
rt_get_fields
, which are documented but not exported. What this means in practice is that the code can't be found or used but the documentation does appear within the package index, which is pretty confusing. I'd suggest adding the "internal" roxygen2 keyword to the docs for those functions so that you don't index them.1) drop unexported from manual
2) vignette title / index entry
3) The vignette opens "The package ...
4) The vignette could really benefit from a brief explanation of what robots.txt files are and why they matter ...
path_allowed() will now warn more informative describing what happened:
robotstxt::path_allowed() says: Encountered problems while trying to determine bot permissions, returning NA instead of TRUE/FALSE
Dear Oliver (@Ironholds ),
thank you very much for taking the time to review my package and comment on it. I think the package gained a lot from your suggestions and I myself learned a lot as well. I really appreciate the effort. I would like to add you as contributor - what is the ropensci policy on that matter - within DESCRIPTION and README, if that's ok with you.
I feel like I should say "the package looks really good now, sorry for not mentioning it earlier, this week has been nuts" before accepting or rejecting ;p.
I'm sorry, I have left this very late so I will just jot down the thoughts I have rather than a full review, in the hope that it is useful.
My biggest major comment is that I don't see what R6 is adding here; it seems that the return value could be a list (which contains one function).
Don't get me wrong -- I love R6. But in cases where reference semantics aren't needed or being used it seems an unnecessary weirdness for most users (and in my experience users do find them weird). See the code below for a pure-list implementation of your function (purposefully left very similar) to show that there is no real difference in implementation. The only difference is that initialisation looks less weird (x <- robotstxt()
not x <- robotstxt$new()
).
robotstxt <- function(domain, text) {
## check input
self <- list()
if (missing(domain)) {
self$domain <- NA
}
if (!missing(text)){
self$text <- text
if(!missing(domain)){
self$domain <- domain
}
}else{
if(!missing(domain)){
self$domain <- domain
self$text <- get_robotstxt(domain)
}else{
stop("robotstxt: I need text to initialize.")
}
}
## fill fields with default data
tmp <- parse_robotstxt(self$text)
self$bots <- tmp$useragents
self$comments <- tmp$comments
self$permissions <- tmp$permissions
self$crawl_delay <- tmp$crawl_delay
self$host <- tmp$host
self$sitemap <- tmp$sitemap
self$other <- tmp$other
self$check <- function(paths="/", bot="*", permission=self$permissions) {
paths_allowed(permissions=permission, paths=paths, bot=bot)
}
class(self) <- "robotstxt"
self
}
(this passes checks in the package by deleting only the $new
in the test file).
I really like the idea of downloading the file once and testing repeatedly. I wonder if that could be extended with the assumption that the file does not change during an R session. Then one could memoise the calls and avoid having to have any sort of object for the users to worry about.
cache <- new.env(parent=emptyenv())
path_allowed <- function(url) {
x <- httr::parse_url(url)
obj <- cache[[x$hostname]]
if (is.null(obj)) {
cache[[x$hostname]] <- obj <- robotstxt(hostname)
}
obj$check(x$path)
}
Avoid httr::content(rtxt)
in package code (see ?httr::content
) in favour of checking that the return type was correct. In development versions of httr I find I also have to declare encoding="UTF-8"
to avoid a message on every request, too.
In get_robotstxt
, is warning
the appropriate behaviour? I would have thought that for 403 and 5xx errors it should be an error. For 404 errors could we not assume that there _is__ no robots.txt and therefore it's all fair game?
I find that missing()
lends itself to hard-to-diagnose errors when functions are several layers deep; consider explicit NULL arguments and is.null()
tests for missing-ness (purely style).
You have a partial match of $useragent
to $useragents
in R/permissions.R
and throughout tests/testthat/test_parser.R
Pfuh, some thoughts -aye? Sounds all very interesting and helpful in the sense that I really have to think about it. Now, this will take me a while. Thank you very much - most appreciated!
@richfitz
I am working on implementing all of your suggestion. But I really would like to understand better what happens in https://github.com/petermeissner/robotstxt/issues/6 .
Do you have maybe a reference or a catch phrase for me that I can look up? What I specifically have a hard time wrapping my head around is that the function within the list knows where to find the data.
example
# function definition
robotstxt_dev <- function(domain, text) {
## check input
self <- list()
if (missing(domain)) {
self$domain <- NA
}
if (!missing(text)){
self$text <- text
if(!missing(domain)){
self$domain <- domain
}
}else{
if(!missing(domain)){
self$domain <- domain
self$text <- get_robotstxt(domain)
}else{
stop("robotstxt: I need text to initialize.")
}
}
## fill fields with default data
tmp <- parse_robotstxt(self$text)
self$bots <- tmp$useragents
self$comments <- tmp$comments
self$permissions <- tmp$permissions
self$crawl_delay <- tmp$crawl_delay
self$host <- tmp$host
self$sitemap <- tmp$sitemap
self$other <- tmp$other
self$check <- function(paths="/", bot="*", permission=self$permissions) {
paths_allowed(permissions=permission, paths=paths, bot=bot)
}
class(self) <- "robotstxt"
self
}
# example
library(robotstxt)
blah <- robotstxt_dev("pmeissner.com")$check
blah("_layouts")
At the moment, that feels like pure black magic.
@richfitz
I have now implemented all suggestions.
Thank you very very much for taking the time to go through my code. The package gained a lot from that, I think and I learned lot for myself. I really appreciate the effort. Thanks.
Just some further thoughts ...
In regard to discarding R6 because of not using "reference semantics" I have done some research - ah, that's what it is.
robotstxt$new()
instead of robotstxt()
: I do not think that the first is more confusing than the latter. In the first case the user is explicitly told that he will use something that might be unusual. In the second case it is more like giving the user something that looks like something he understands but than letting it do black magic (the black magic is called clojure and is very well explained in Wickham 2015: "Advanced R", p. 175ff, in particular p. 186)Hey everyone (@sckott ), so what is the next step? All suggestions have been gratefully aceppted and incorporated. Is there another thumbs up needed from @richfitz or something else I should/can do or?
seems ready to merge in. I do see
✔ checking for missing documentation entries
W checking for code/documentation mismatches
Codoc mismatches from documentation object 'robotstxt':
robotstxt
Code: function(domain = NULL, text = NULL)
Docs: function(domain = "mydomain.com")
Argument names in code not in docs:
text
Mismatches in argument default values:
Name: 'domain' Code: NULL Docs: "mydomain.com"
robotstxt
Code: function(domain = NULL, text = NULL)
Docs: function(text = "User-agent: *\nDisallow: /")
Argument names in code not in docs:
domain
Mismatches in argument names:
Position: 1 Code: domain Docs: text
Mismatches in argument default values:
Name: 'text' Code: NULL Docs: "User-agent: *\nDisallow: /"
W checking Rd \usage sections
Undocumented arguments in documentation object 'get_robotstxt'
‘warn’
Undocumented arguments in documentation object 'robotstxt'
‘domain’ ‘text’
can these be fixed first?
Sorry, I missed that. @sckott thanks for checking - its all fixed.
All checks now pass locally (Ubuntu) without errors, warnings or notes, the same is true for winbuilder checks.
Great! Thanks. Approved.
I added you to a team with admin access, and now you should be able to transfer the repo to ropenscilabs
- we move pkgs to ropensci
github org once more mature - your pkg is still a fully ropensci pkg even though in ropenscilabs
Coooool.
Transfer repo? : It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs" - right? How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.
It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs"
yes
How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.
same as before - you are still the maintainer of your pkg, and can submit to CRAN as you like. You can push new version to CRAN before transferring if you like
Transfered to ropenscilabs - everything works fine.
awesome, thanks for your work and contribution
Nah..., I am the one indebted for getting all that great feadback - thanks for letting me join the club.
Web scraping allows to gather information of scientific value - mainly social science related in my experience. While scraping web pages one should respect permissions declared in robots.txt files. The package provides functions to retrieve and parse robots.txt files. The core functionality is to check a bots/users permission for one or more resources (paths) for a given domain. To ease checking all functions have been bundled with relevant data into an R6
robotstxt
class but everything works functional or object oriented depending on the users preferences.https://github.com/petermeissner/robotstxt
robots.txt files like:
Package developers and users that want an easy way to be nice while gathering data from the web.
None that I know of.
Yes, good guidelines!
No.
With or without ropensci.
yes, MIT
devtools::check()
produce any errors or warnings? If so paste them below.no:
Does not apply.
No, no resubmit.