Open RajatGoyal opened 9 years ago
Good finding, actually. This could happen because of redirects. When redirect happens, Frontera will get a response object with last (already redirected) URL and will not match it with record in database. Therefore, it will create a new record and mark it as CRAWLED
, and old one remain QUEUED
.
There is canonical solvers mechanism which should be returning canonical URL, https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/canonicalsolvers/basic.py
depending on that we could mark old record as CRAWLED
also. But that needs to be coded, PR is welcome as usual.
This is happening for every case and not just for redirects. To test I wrote a simple spider with sqlalchemy backend:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
raise Exception("Test Exception")
If I run this, error due to exception is not getting set in database.
This is again a good finding, @RajatGoyal! We could solve that by handling exceptions in spider middleware, and propagating them to backend. If you could make a PR, that would be awesome!
any idea on how to propagate it to the backend, we can't get manager from the spider middleware?
We need to adapt interfaces in FronteraManagerWrapper
, FronteraManager
and Backend
. I think we need to propagate type (error happened during response processing), response itself, along with error structure.
I have fixed this, but I don't have write permission to push a branch now, it gives 403 response.
Can you fork Frontera to your own account, and use your local branch?
@sibiryakov Take a look at the above branch.
Hi, If there is any exception with response parsing in scrapy, the request remain marked as
QUEUED
and no error is logged on the frontier.