Open JohnMTrimbleIII opened 6 years ago
+1 I'm having the same problem. The spider looks for http://localhost:8050/robots.txt, that does not exist. And I'm having trouble to implement the rules of my target site.
same problem here...
spider first downloading the correct robots and then trys to download localhost robots.
2019-02-16 21:51:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://testwebsite.de/robots.txt> (referer: None)
2019-02-16 21:51:02 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
Robots.txt is read at the start of crawling. You could disable that feature from settings or to write a middleware apropos robots.
https://docs.scrapy.org/en/latest/topics/settings.html
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots
https://stackoverflow.com/questions/37274835/getting-forbidden-by-robots-txt-scrapy
El sáb., 16 feb. 2019 21:53, Tobias Keller notifications@github.com escribió:
same problem here
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scrapy-plugins/scrapy-splash/issues/180#issuecomment-464384160, or mute the thread https://github.com/notifications/unsubscribe-auth/Agwyu8nVvM1LYh3fAQOf1qQ2mh2wuNMUks5vOG_bgaJpZM4U-YZk .
I disabled the robotstxt midware, sub-classed it and changed the line that loads the file in the first place. So it took the right URL and worked.
In my case, I wanted to obey the robots.txt file. Just turn it off was not a solution.
I disabled the robotstxt midware, sub-classed it and changed the line that loads the file in the first place. So it took the right URL and worked.
In my case, I wanted to obey the robots.txt file. Just turn it off was not a solution.
can you share this? disabling the hole robots is no option.
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.http import Request
from twisted.internet.defer import Deferred
from scrapy.utils.httpobj import urlparse_cached
class MyRobotsTxtMiddleware(RobotsTxtMiddleware):
def robot_parser(self, request, spider):
url = urlparse_cached(request)
netloc = url.netloc
if netloc not in self._parsers:
self._parsers[netloc] = Deferred()
robotsurl = "https://www.example.com/robots.txt"
robotsreq = Request(
robotsurl,
priority=self.DOWNLOAD_PRIORITY,
meta={'dont_obey_robotstxt': True}
)
dfd = self.crawler.engine.download(robotsreq, spider)
dfd.addCallback(self._parse_robots, netloc)
dfd.addErrback(self._logerror, robotsreq, spider)
dfd.addErrback(self._robots_error, netloc)
self.crawler.stats.inc_value('robotstxt/request_count')
if isinstance(self._parsers[netloc], Deferred):
d = Deferred()
def cb(result):
d.callback(result)
return result
self._parsers[netloc].addCallback(cb)
return d
else:
return self._parsers[netloc]
SPIDER_MIDDLEWARES = {
'mycrawler.middlewares.MyRobotsTxtMiddleware': 1,
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware':None
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware from scrapy.http import Request from twisted.internet.defer import Deferred from scrapy.utils.httpobj import urlparse_cached class MyRobotsTxtMiddleware(RobotsTxtMiddleware): def robot_parser(self, request, spider): url = urlparse_cached(request) netloc = url.netloc if netloc not in self._parsers: self._parsers[netloc] = Deferred() robotsurl = "https://www.example.com/robots.txt" robotsreq = Request( robotsurl, priority=self.DOWNLOAD_PRIORITY, meta={'dont_obey_robotstxt': True} ) dfd = self.crawler.engine.download(robotsreq, spider) dfd.addCallback(self._parse_robots, netloc) dfd.addErrback(self._logerror, robotsreq, spider) dfd.addErrback(self._robots_error, netloc) self.crawler.stats.inc_value('robotstxt/request_count') if isinstance(self._parsers[netloc], Deferred): d = Deferred() def cb(result): d.callback(result) return result self._parsers[netloc].addCallback(cb) return d else: return self._parsers[netloc]
@ArthurJ where did you add this code, though? I'm quite a newbie on web crawling and I have been having huge trouble with my crawler not returning what it should. Thanks.
The same thing happens to me where the spider first downloads the correct robots and then tries to download localhost robots. However I still see on my logs that some links are Forbidden by robots.txt
so I'm a bit confused whether the spider really obeys robot.txt or not.
Is scrapy-splash not compatible with obeying robots.txt? Everytime I make a query it attempts to download the robots.txt from the docker instance of scrapy-splash. The below is my settings file. I'm thinking it may be a misordering of the middlewares, but I'm not sure what it should look like.