ropensci / software-review

rOpenSci Software Peer Review.
291 stars 104 forks source link

robotstxt #25

Closed petermeissner closed 8 years ago

petermeissner commented 8 years ago

Web scraping allows to gather information of scientific value - mainly social science related in my experience. While scraping web pages one should respect permissions declared in robots.txt files. The package provides functions to retrieve and parse robots.txt files. The core functionality is to check a bots/users permission for one or more resources (paths) for a given domain. To ease checking all functions have been bundled with relevant data into an R6 robotstxt class but everything works functional or object oriented depending on the users preferences.

Package: robotstxt
Type: Package
Title: A 'robots.txt' Parser and Webbot/Webspider/Webcrawler Permissions Checker
Version: 0.1.0
Author: Peter Meissner
Maintainer: Peter Meissner <retep.meissner@gmai.com>
Description: Class ('R6') and accompanying methods to
    parse and check 'robots.txt' files. Data fields are provided as
    data frames and vectors. Permissions can be checked by providing
    path character vectors and optional bot names.
License: MIT + file LICENSE
LazyData: TRUE
BugReports: https://github.com/petermeissner/robotstxt/issues
URL: https://github.com/petermeissner/robotstxt
Imports:
    R6 (>= 2.1.1),
    stringr (>= 1.0.0),
    httr (>= 1.0.0)
Suggests:
    knitr,
    rmarkdown,
    dplyr,
    testthat
Depends:
    R (>= 3.0.0)
VignetteBuilder: knitr
RoxygenNote: 5.0.1

https://github.com/petermeissner/robotstxt

robots.txt files like:

Package developers and users that want an easy way to be nice while gathering data from the web.

None that I know of.

Yes, good guidelines!

No.

With or without ropensci.

yes, MIT

no:

* DONE
Status: OK

R CMD check succeeded

Does not apply.

No, no resubmit.

sckott commented 8 years ago

Thanks for the submission :rocket: - seeking reviewers

sckott commented 8 years ago

Reviewers: @richfitz @Ironholds

Ironholds commented 8 years ago
  1. Unless you own gmai.com there's a typo in your DESCRIPTION file ;p.
  2. Do you need the alternate df implementations? They're fun but also a lot of extra code to replace stringsAsFactors = FALSE ;P.
  3. The assignment here seems extraneous.
  4. "did apply" to "applied" here

As you may notice, most of these quibbles are well, quibbles. On the whole the package is excellent :). I'll try to put time into a more detailed review later today or early tomorrow.

petermeissner commented 8 years ago

Thanks for the ropensci infrastructure and thanks for the feedback - much-much-much appreciated.

in response to @Ironholds

1) Unless you own gmai.com there's a typo in your DESCRIPTION file ;p.

2) Do you need the alternate df implementations? They're fun but also a lot of extra code to replace stringsAsFactors = FALSE ;P.

3) The assignment here seems extraneous.

4) "did apply" to "applied" here

question of my own

path_allowed() (link) will response to unforseen permission constallations with a warning and return all data that led to the problem.

Ironholds commented 8 years ago

I think that's totally fine behaviour, but I'd caution you, in that scenario, to use more formal/explanatory description text, noting that problematic data will be returned.

(On the subject of formality, working on a more formal review now)

Ironholds commented 8 years ago

On the whole: solid work! Some thoughts:

  1. There are a lot of functions, such as rt_get_fields, which are documented but not exported. What this means in practice is that the code can't be found or used but the documentation does appear within the package index, which is pretty confusing. I'd suggest adding the "internal" roxygen2 keyword to the docs for those functions so that you don't index them.
  2. The vignette is still using "Vignette Title" for indexing, so that's what CRAN will list it as (do not feel bad, I have made this mistake 150m times and never notice until I've already submitted it to CRAN and they've accepted >.>)
  3. The vignette opens "The package provides a class (‘R6’)". This makes it sound like the class is called R6, at least to me.
  4. The vignette could really benefit from a brief explanation of what robots.txt files are and why they matter at the top, with some examples, rather than just links at the bottom.
petermeissner commented 8 years ago

in response to @Ironholds II

1) drop unexported from manual

2) vignette title / index entry

3) The vignette opens "The package ...

4) The vignette could really benefit from a brief explanation of what robots.txt files are and why they matter ...

petermeissner commented 8 years ago

in regard to how path_allowed() handles problems with permission deduction

petermeissner commented 8 years ago

Dear Oliver (@Ironholds ),

thank you very much for taking the time to review my package and comment on it. I think the package gained a lot from your suggestions and I myself learned a lot as well. I really appreciate the effort. I would like to add you as contributor - what is the ropensci policy on that matter - within DESCRIPTION and README, if that's ok with you.

Ironholds commented 8 years ago

I feel like I should say "the package looks really good now, sorry for not mentioning it earlier, this week has been nuts" before accepting or rejecting ;p.

richfitz commented 8 years ago

I'm sorry, I have left this very late so I will just jot down the thoughts I have rather than a full review, in the hope that it is useful.

My biggest major comment is that I don't see what R6 is adding here; it seems that the return value could be a list (which contains one function).

Don't get me wrong -- I love R6. But in cases where reference semantics aren't needed or being used it seems an unnecessary weirdness for most users (and in my experience users do find them weird). See the code below for a pure-list implementation of your function (purposefully left very similar) to show that there is no real difference in implementation. The only difference is that initialisation looks less weird (x <- robotstxt() not x <- robotstxt$new()).

robotstxt <- function(domain, text) {
  ## check input
  self <- list()
  if (missing(domain)) {
    self$domain <- NA
  }
  if (!missing(text)){
    self$text <- text
    if(!missing(domain)){
      self$domain <- domain
    }
  }else{
    if(!missing(domain)){
      self$domain <- domain
      self$text   <- get_robotstxt(domain)
    }else{
      stop("robotstxt: I need text to initialize.")
    }
  }
  ## fill fields with default data

  tmp <- parse_robotstxt(self$text)
  self$bots        <- tmp$useragents
  self$comments    <- tmp$comments
  self$permissions <- tmp$permissions
  self$crawl_delay <- tmp$crawl_delay
  self$host        <- tmp$host
  self$sitemap     <- tmp$sitemap
  self$other       <- tmp$other

  self$check <- function(paths="/", bot="*", permission=self$permissions) {
    paths_allowed(permissions=permission, paths=paths, bot=bot)
  }

  class(self) <- "robotstxt"
  self
}

(this passes checks in the package by deleting only the $new in the test file).

I really like the idea of downloading the file once and testing repeatedly. I wonder if that could be extended with the assumption that the file does not change during an R session. Then one could memoise the calls and avoid having to have any sort of object for the users to worry about.

cache <- new.env(parent=emptyenv())
path_allowed <- function(url) {
  x <- httr::parse_url(url)
  obj <- cache[[x$hostname]]
  if (is.null(obj)) {
    cache[[x$hostname]] <- obj <- robotstxt(hostname)
  }
  obj$check(x$path)
}

Minor comments

Avoid httr::content(rtxt) in package code (see ?httr::content) in favour of checking that the return type was correct. In development versions of httr I find I also have to declare encoding="UTF-8" to avoid a message on every request, too.

In get_robotstxt, is warning the appropriate behaviour? I would have thought that for 403 and 5xx errors it should be an error. For 404 errors could we not assume that there _is__ no robots.txt and therefore it's all fair game?

I find that missing() lends itself to hard-to-diagnose errors when functions are several layers deep; consider explicit NULL arguments and is.null() tests for missing-ness (purely style).

You have a partial match of $useragent to $useragents in R/permissions.R and throughout tests/testthat/test_parser.R

petermeissner commented 8 years ago

Pfuh, some thoughts -aye? Sounds all very interesting and helpful in the sense that I really have to think about it. Now, this will take me a while. Thank you very much - most appreciated!

petermeissner commented 8 years ago

@richfitz

I am working on implementing all of your suggestion. But I really would like to understand better what happens in https://github.com/petermeissner/robotstxt/issues/6 .

Do you have maybe a reference or a catch phrase for me that I can look up? What I specifically have a hard time wrapping my head around is that the function within the list knows where to find the data.

example

# function definition

robotstxt_dev <- function(domain, text) {
  ## check input
  self <- list()
  if (missing(domain)) {
    self$domain <- NA
  }
  if (!missing(text)){
    self$text <- text
    if(!missing(domain)){
      self$domain <- domain
    }
  }else{
    if(!missing(domain)){
      self$domain <- domain
      self$text   <- get_robotstxt(domain)
    }else{
      stop("robotstxt: I need text to initialize.")
    }
  }
  ## fill fields with default data

  tmp <- parse_robotstxt(self$text)
  self$bots        <- tmp$useragents
  self$comments    <- tmp$comments
  self$permissions <- tmp$permissions
  self$crawl_delay <- tmp$crawl_delay
  self$host        <- tmp$host
  self$sitemap     <- tmp$sitemap
  self$other       <- tmp$other

  self$check <- function(paths="/", bot="*", permission=self$permissions) {
    paths_allowed(permissions=permission, paths=paths, bot=bot)
  }

  class(self) <- "robotstxt"
  self
}

# example
library(robotstxt)
blah <- robotstxt_dev("pmeissner.com")$check
blah("_layouts")

At the moment, that feels like pure black magic.

petermeissner commented 8 years ago

@richfitz

I have now implemented all suggestions.

Thank you very very much for taking the time to go through my code. The package gained a lot from that, I think and I learned lot for myself. I really appreciate the effort. Thanks.

petermeissner commented 8 years ago

Just some further thoughts ...

In regard to discarding R6 because of not using "reference semantics" I have done some research - ah, that's what it is.

petermeissner commented 8 years ago

Hey everyone (@sckott ), so what is the next step? All suggestions have been gratefully aceppted and incorporated. Is there another thumbs up needed from @richfitz or something else I should/can do or?

sckott commented 8 years ago

seems ready to merge in. I do see

✔  checking for missing documentation entries
W  checking for code/documentation mismatches
   Codoc mismatches from documentation object 'robotstxt':
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(domain = "mydomain.com")
     Argument names in code not in docs:
       text
     Mismatches in argument default values:
       Name: 'domain' Code: NULL Docs: "mydomain.com"
   robotstxt
     Code: function(domain = NULL, text = NULL)
     Docs: function(text = "User-agent: *\nDisallow: /")
     Argument names in code not in docs:
       domain
     Mismatches in argument names:
       Position: 1 Code: domain Docs: text
     Mismatches in argument default values:
       Name: 'text' Code: NULL Docs: "User-agent: *\nDisallow: /"

W  checking Rd \usage sections
   Undocumented arguments in documentation object 'get_robotstxt'
     ‘warn’

   Undocumented arguments in documentation object 'robotstxt'
     ‘domain’ ‘text’

can these be fixed first?

petermeissner commented 8 years ago

Sorry, I missed that. @sckott thanks for checking - its all fixed.

All checks now pass locally (Ubuntu) without errors, warnings or notes, the same is true for winbuilder checks.

sckott commented 8 years ago

Great! Thanks. Approved.

I added you to a team with admin access, and now you should be able to transfer the repo to ropenscilabs - we move pkgs to ropensci github org once more mature - your pkg is still a fully ropensci pkg even though in ropenscilabs

petermeissner commented 8 years ago

Coooool.

Transfer repo? : It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs" - right? How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.

sckott commented 8 years ago

It is done via Settings > Danger Zone > Transfer ownership : "ropenscilabs"

yes

How is pushing to CRAN handled thereafter? I would like to push the current thing to CRAN.

same as before - you are still the maintainer of your pkg, and can submit to CRAN as you like. You can push new version to CRAN before transferring if you like

petermeissner commented 8 years ago

Transfered to ropenscilabs - everything works fine.

sckott commented 8 years ago

awesome, thanks for your work and contribution

petermeissner commented 8 years ago

Nah..., I am the one indebted for getting all that great feadback - thanks for letting me join the club.