ropensci / robotstxt

R 📦 for parsing and checking robots.txt files 🤖
https://docs.ropensci.org/robotstxt
Other
69 stars 10 forks source link

Possible logic/regex error with `paths_allowed()` #18

Closed hrbrmstr closed 7 years ago

hrbrmstr commented 7 years ago

(explanation follows)

library(robotstxt)

get_robotstxt("www.cdc.gov")
## # Ignore FrontPage files
## User-agent: *
## Disallow: /_borders
## Disallow: /_derived
## Disallow: /_fpclass
## Disallow: /_overlay
## Disallow: /_private
## Disallow: /_themes
## Disallow: /_vti_bin
## Disallow: /_vti_cnf
## Disallow: /_vti_log
## Disallow: /_vti_map
## Disallow: /_vti_pvt
## Disallow: /_vti_txt
## 
## # Do not index the following URLs
## Disallow: /travel/
## Disallow: /flu/espanol/
## Disallow: /migration/
## Disallow: /Features/SpinaBifidaProgram/
## Disallow: /concussion/HeadsUp/training/
## 
## # Don't spider search pages
## Disallow: /search.do
## 
## # Don't spider email-this-page pages
## Disallow: /email.do
##  
## # Don't spider printer-friendly versions of pages
## Disallow: /print.do
## 
## # Rover is a bad dog
## User-agent: Roverbot
## Disallow: /
## 
## # EmailSiphon is a hunter/gatherer which extracts email addresses for spam-mailers to use
## User-agent: EmailSiphon
## Disallow: /
## 
## # Exclude MindSpider since it appears to be ill-behaved
## User-agent: MindSpider
## Disallow: /
## 
## # Sitemap link per CR14586
## Sitemap: http://www.cdc.gov/niosh/sitemaps/sitemapsNIOSH.xml
## 
paths_allowed("/asthma/asthma_stats/default.htm", "www.cdc.gov")
## [1] FALSE

Via: https://technicalseo.com/seo-tools/robots-txt/

image

And:

import urllib.robotparser as robotparser

parser = robotparser.RobotFileParser()
parser.set_url('http://www.cdc.gov/robots.txt')
parser.read()
print(parser.can_fetch("*", "/asthma/asthma_stats/default.htm"))
## True

I was prepping a blog post to introduce a function that would prefix an httr::GET() request with a robots.txt path check (which is part of a larger personal project I'm working on). As you can see, it did not work properly on the CDC site which does, indeed, allow scraping.

It works fine on others in the examples I was using, such as https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC.

I haven't poked at the code yet to see what may be causing the disconnect but I'll see if I can figure out what's going on.

I issued first since this might just trigger an "Aha!" on your end with a quick fix :-)

petermeissner commented 7 years ago

Hey, thanks for reporting ...

I just had a quick look at your example and I think there is a bug in robotstxt but I cannot point my finger to it. Robots.txt retrieval seems to work but there are too many restrictions:

rtxt <-  robotstxt(
    domain     = "www.cdc.gov"
  )

rtxt$text
rtxt$permissions
## field   useragent                         value
## 1  Disallow           *                     /_borders
## 2  Disallow    Roverbot                     /_derived
## 3  Disallow EmailSiphon                     /_fpclass
## 4  Disallow  MindSpider                     /_overlay
## 5  Disallow           *                     /_private
## 6  Disallow    Roverbot                      /_themes
## 7  Disallow EmailSiphon                     /_vti_bin
## 8  Disallow  MindSpider                     /_vti_cnf
## 9  Disallow           *                     /_vti_log
## 10 Disallow    Roverbot                     /_vti_map
## 11 Disallow EmailSiphon                     /_vti_pvt
## 12 Disallow  MindSpider                     /_vti_txt
## 13 Disallow           *                      /travel/
## 14 Disallow    Roverbot                 /flu/espanol/
## 15 Disallow EmailSiphon                   /migration/
## 16 Disallow  MindSpider /Features/SpinaBifidaProgram/
## 17 Disallow           * /concussion/HeadsUp/training/
## 18 Disallow    Roverbot                    /search.do
## 19 Disallow EmailSiphon                     /email.do
## 20 Disallow  MindSpider                     /print.do
## 21 Disallow           *                             /
## 22 Disallow    Roverbot                             /
## 23 Disallow EmailSiphon                             /
## 24 Disallow  MindSpider                     /_borders
## 25 Disallow           *                     /_derived
## 26 Disallow    Roverbot                     /_fpclass
## 27 Disallow EmailSiphon                     /_overlay
## 28 Disallow  MindSpider                     /_private
## 29 Disallow           *                      /_themes
## 30 Disallow    Roverbot                     /_vti_bin
## 31 Disallow EmailSiphon                     /_vti_cnf
## 32 Disallow  MindSpider                     /_vti_log
## 33 Disallow           *                     /_vti_map
## 34 Disallow    Roverbot                     /_vti_pvt
## 35 Disallow EmailSiphon                     /_vti_txt
## 36 Disallow  MindSpider                      /travel/
## 37 Disallow           *                 /flu/espanol/
## 38 Disallow    Roverbot                   /migration/
## 39 Disallow EmailSiphon /Features/SpinaBifidaProgram/
## 40 Disallow  MindSpider /concussion/HeadsUp/training/
## 41 Disallow           *                    /search.do
## 42 Disallow    Roverbot                     /email.do
## 43 Disallow EmailSiphon                     /print.do
## 44 Disallow  MindSpider                             /
## 45 Disallow           *                             /
## 46 Disallow    Roverbot                             /
## 47 Disallow EmailSiphon                     /_borders
## 48 Disallow  MindSpider                     /_derived
## 49 Disallow           *                     /_fpclass
## 50 Disallow    Roverbot                     /_overlay
## 51 Disallow EmailSiphon                     /_private
## 52 Disallow  MindSpider                      /_themes
## 53 Disallow           *                     /_vti_bin
## 54 Disallow    Roverbot                     /_vti_cnf
## 55 Disallow EmailSiphon                     /_vti_log
## 56 Disallow  MindSpider                     /_vti_map
## 57 Disallow           *                     /_vti_pvt
## 58 Disallow    Roverbot                     /_vti_txt
## 59 Disallow EmailSiphon                      /travel/
## 60 Disallow  MindSpider                 /flu/espanol/
## 61 Disallow           *                   /migration/
## 62 Disallow    Roverbot /Features/SpinaBifidaProgram/
## 63 Disallow EmailSiphon /concussion/HeadsUp/training/
## 64 Disallow  MindSpider                    /search.do
## 65 Disallow           *                     /email.do
## 66 Disallow    Roverbot                     /print.do
## 67 Disallow EmailSiphon                             /
## 68 Disallow  MindSpider                             /
## 69 Disallow           *                             /
## 70 Disallow    Roverbot                     /_borders
## 71 Disallow EmailSiphon                     /_derived
## 72 Disallow  MindSpider                     /_fpclass
## 73 Disallow           *                     /_overlay
## 74 Disallow    Roverbot                     /_private
## 75 Disallow EmailSiphon                      /_themes
## 76 Disallow  MindSpider                     /_vti_bin
## 77 Disallow           *                     /_vti_cnf
## 78 Disallow    Roverbot                     /_vti_log
## 79 Disallow EmailSiphon                     /_vti_map
## 80 Disallow  MindSpider                     /_vti_pvt
## 81 Disallow           *                     /_vti_txt
## 82 Disallow    Roverbot                      /travel/
## 83 Disallow EmailSiphon                 /flu/espanol/
## 84 Disallow  MindSpider                   /migration/
## 85 Disallow           * /Features/SpinaBifidaProgram/
## 86 Disallow    Roverbot /concussion/HeadsUp/training/
## 87 Disallow EmailSiphon                    /search.do
## 88 Disallow  MindSpider                     /email.do
## 89 Disallow           *                     /print.do
## 90 Disallow    Roverbot                             /
## 91 Disallow EmailSiphon                             /
## 92 Disallow  MindSpider                             /

... looks like some recycling of values when creating the permission data.frame gone wild -- maybe.

hrbrmstr commented 7 years ago

I threw this — https://github.com/hrbrmstr/rep — together since the library it wraps is a pretty decent one, insofar as parsing & testing goes. (I wasn't sure if you'd want a C++-11 dependency in robotstxt…if so, I can add some error checking see about making this be an invisible addition to robotstxt-proper).

hrbrmstr commented 7 years ago

(just a follow-up) I had some cycles so I added appveyor checks and lo and behold it works on Windows as well as macOS and Linux (so x-platform won't be an issue).

petermeissner commented 7 years ago

1) I had a deeper look and could reproduce / isolate the error - there is an issue with "\r\n"-line ending although I do not really get why - yet 2) The cdc.gov/robots.txt is included within the package and has two simple tests

hrbrmstr commented 7 years ago

oh, cool! #ty for the fix (and I'm glad it wasn't too gnarly)

petermeissner commented 7 years ago

3) I do not mind a C++-dependency IF it's worth the additional dependency ...

If you think this is solid, reasonably easy to integrate and a little more battle tested than go ahead. One solid solution is better than two or three brittle ones.

petermeissner commented 7 years ago

I reopened because it is still not fixed and I will not have time to do so today.

hrbrmstr commented 7 years ago

I'll whip up a proposed implementation for it in an updated fork and shoot a link when it's done.

hrbrmstr commented 7 years ago

Just a note why I started looking at a C++ impl:

library(robotstxt)
library(rep)
library(microbenchmark)

rt_raw <- get_robotstxt("https://cdc.gov")
rep <- robxp(rt_raw)

microbenchmark(
  fetchable = can_fetch(rep, "/asthma/asthma_stats/default.htm", "*")
) -> mb

mb
## Unit: microseconds
##       expr   min    lq    mean median     uq     max neval
##  fetchable 5.713 6.253 9.12847  6.447 6.7455 220.458   100

~6 microseconds is kinda sweet and if I vectorize at C++-level then it'll still be pretty fast.

petermeissner commented 7 years ago

nice - also robotstxt depends on stringr and httr, so it is not exactly lightweight - anyways

petermeissner commented 7 years ago

I am 99.99 percent sure that actually www.cdc.gov 's robotstxt is not valid - there shall be no empty lines except above user-agent fields.

I'm still not sure how to best handle this edge case.

petermeissner commented 7 years ago

I found the problem ... the file is valid but my function did not handle '\r\n' line endings end got confused.