Closed hrbrmstr closed 7 years ago
Hey, thanks for reporting ...
I just had a quick look at your example and I think there is a bug in robotstxt but I cannot point my finger to it. Robots.txt retrieval seems to work but there are too many restrictions:
rtxt <- robotstxt(
domain = "www.cdc.gov"
)
rtxt$text
rtxt$permissions
## field useragent value
## 1 Disallow * /_borders
## 2 Disallow Roverbot /_derived
## 3 Disallow EmailSiphon /_fpclass
## 4 Disallow MindSpider /_overlay
## 5 Disallow * /_private
## 6 Disallow Roverbot /_themes
## 7 Disallow EmailSiphon /_vti_bin
## 8 Disallow MindSpider /_vti_cnf
## 9 Disallow * /_vti_log
## 10 Disallow Roverbot /_vti_map
## 11 Disallow EmailSiphon /_vti_pvt
## 12 Disallow MindSpider /_vti_txt
## 13 Disallow * /travel/
## 14 Disallow Roverbot /flu/espanol/
## 15 Disallow EmailSiphon /migration/
## 16 Disallow MindSpider /Features/SpinaBifidaProgram/
## 17 Disallow * /concussion/HeadsUp/training/
## 18 Disallow Roverbot /search.do
## 19 Disallow EmailSiphon /email.do
## 20 Disallow MindSpider /print.do
## 21 Disallow * /
## 22 Disallow Roverbot /
## 23 Disallow EmailSiphon /
## 24 Disallow MindSpider /_borders
## 25 Disallow * /_derived
## 26 Disallow Roverbot /_fpclass
## 27 Disallow EmailSiphon /_overlay
## 28 Disallow MindSpider /_private
## 29 Disallow * /_themes
## 30 Disallow Roverbot /_vti_bin
## 31 Disallow EmailSiphon /_vti_cnf
## 32 Disallow MindSpider /_vti_log
## 33 Disallow * /_vti_map
## 34 Disallow Roverbot /_vti_pvt
## 35 Disallow EmailSiphon /_vti_txt
## 36 Disallow MindSpider /travel/
## 37 Disallow * /flu/espanol/
## 38 Disallow Roverbot /migration/
## 39 Disallow EmailSiphon /Features/SpinaBifidaProgram/
## 40 Disallow MindSpider /concussion/HeadsUp/training/
## 41 Disallow * /search.do
## 42 Disallow Roverbot /email.do
## 43 Disallow EmailSiphon /print.do
## 44 Disallow MindSpider /
## 45 Disallow * /
## 46 Disallow Roverbot /
## 47 Disallow EmailSiphon /_borders
## 48 Disallow MindSpider /_derived
## 49 Disallow * /_fpclass
## 50 Disallow Roverbot /_overlay
## 51 Disallow EmailSiphon /_private
## 52 Disallow MindSpider /_themes
## 53 Disallow * /_vti_bin
## 54 Disallow Roverbot /_vti_cnf
## 55 Disallow EmailSiphon /_vti_log
## 56 Disallow MindSpider /_vti_map
## 57 Disallow * /_vti_pvt
## 58 Disallow Roverbot /_vti_txt
## 59 Disallow EmailSiphon /travel/
## 60 Disallow MindSpider /flu/espanol/
## 61 Disallow * /migration/
## 62 Disallow Roverbot /Features/SpinaBifidaProgram/
## 63 Disallow EmailSiphon /concussion/HeadsUp/training/
## 64 Disallow MindSpider /search.do
## 65 Disallow * /email.do
## 66 Disallow Roverbot /print.do
## 67 Disallow EmailSiphon /
## 68 Disallow MindSpider /
## 69 Disallow * /
## 70 Disallow Roverbot /_borders
## 71 Disallow EmailSiphon /_derived
## 72 Disallow MindSpider /_fpclass
## 73 Disallow * /_overlay
## 74 Disallow Roverbot /_private
## 75 Disallow EmailSiphon /_themes
## 76 Disallow MindSpider /_vti_bin
## 77 Disallow * /_vti_cnf
## 78 Disallow Roverbot /_vti_log
## 79 Disallow EmailSiphon /_vti_map
## 80 Disallow MindSpider /_vti_pvt
## 81 Disallow * /_vti_txt
## 82 Disallow Roverbot /travel/
## 83 Disallow EmailSiphon /flu/espanol/
## 84 Disallow MindSpider /migration/
## 85 Disallow * /Features/SpinaBifidaProgram/
## 86 Disallow Roverbot /concussion/HeadsUp/training/
## 87 Disallow EmailSiphon /search.do
## 88 Disallow MindSpider /email.do
## 89 Disallow * /print.do
## 90 Disallow Roverbot /
## 91 Disallow EmailSiphon /
## 92 Disallow MindSpider /
... looks like some recycling of values when creating the permission data.frame gone wild -- maybe.
I threw this — https://github.com/hrbrmstr/rep — together since the library it wraps is a pretty decent one, insofar as parsing & testing goes. (I wasn't sure if you'd want a C++-11 dependency in robotstxt
…if so, I can add some error checking see about making this be an invisible addition to robotstxt
-proper).
(just a follow-up) I had some cycles so I added appveyor checks and lo and behold it works on Windows as well as macOS and Linux (so x-platform won't be an issue).
1) I had a deeper look and could reproduce / isolate the error - there is an issue with "\r\n"-line ending although I do not really get why - yet 2) The cdc.gov/robots.txt is included within the package and has two simple tests
oh, cool! #ty for the fix (and I'm glad it wasn't too gnarly)
3) I do not mind a C++-dependency IF it's worth the additional dependency ...
If you think this is solid, reasonably easy to integrate and a little more battle tested than go ahead. One solid solution is better than two or three brittle ones.
I reopened because it is still not fixed and I will not have time to do so today.
I'll whip up a proposed implementation for it in an updated fork and shoot a link when it's done.
Just a note why I started looking at a C++ impl:
library(robotstxt)
library(rep)
library(microbenchmark)
rt_raw <- get_robotstxt("https://cdc.gov")
rep <- robxp(rt_raw)
microbenchmark(
fetchable = can_fetch(rep, "/asthma/asthma_stats/default.htm", "*")
) -> mb
mb
## Unit: microseconds
## expr min lq mean median uq max neval
## fetchable 5.713 6.253 9.12847 6.447 6.7455 220.458 100
~6 microseconds is kinda sweet and if I vectorize at C++-level then it'll still be pretty fast.
nice - also robotstxt depends on stringr and httr, so it is not exactly lightweight - anyways
I am 99.99 percent sure that actually www.cdc.gov 's robotstxt is not valid - there shall be no empty lines except above user-agent fields.
I'm still not sure how to best handle this edge case.
I found the problem ... the file is valid but my function did not handle '\r\n' line endings end got confused.
(explanation follows)
Via: https://technicalseo.com/seo-tools/robots-txt/
And:
I was prepping a blog post to introduce a function that would prefix an
httr::GET()
request with a robots.txt path check (which is part of a larger personal project I'm working on). As you can see, it did not work properly on the CDC site which does, indeed, allow scraping.It works fine on others in the examples I was using, such as https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC.
I haven't poked at the code yet to see what may be causing the disconnect but I'll see if I can figure out what's going on.
I issued first since this might just trigger an "Aha!" on your end with a quick fix :-)