scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.3k stars 217 forks source link

exception during scrapy callback marked as queued #63

Open RajatGoyal opened 9 years ago

RajatGoyal commented 9 years ago

Hi, If there is any exception with response parsing in scrapy, the request remain marked as QUEUED and no error is logged on the frontier.

sibiryakov commented 9 years ago

Good finding, actually. This could happen because of redirects. When redirect happens, Frontera will get a response object with last (already redirected) URL and will not match it with record in database. Therefore, it will create a new record and mark it as CRAWLED, and old one remain QUEUED.

There is canonical solvers mechanism which should be returning canonical URL, https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/canonicalsolvers/basic.py

depending on that we could mark old record as CRAWLED also. But that needs to be coded, PR is welcome as usual.

RajatGoyal commented 9 years ago

This is happening for every case and not just for redirects. To test I wrote a simple spider with sqlalchemy backend:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = (
        'http://www.example.com/',
    )

    def parse(self, response):
        raise Exception("Test Exception")

If I run this, error due to exception is not getting set in database.

sibiryakov commented 9 years ago

This is again a good finding, @RajatGoyal! We could solve that by handling exceptions in spider middleware, and propagating them to backend. If you could make a PR, that would be awesome!

RajatGoyal commented 9 years ago

any idea on how to propagate it to the backend, we can't get manager from the spider middleware?

sibiryakov commented 9 years ago

We need to adapt interfaces in FronteraManagerWrapper, FronteraManager and Backend. I think we need to propagate type (error happened during response processing), response itself, along with error structure.

RajatGoyal commented 9 years ago

I have fixed this, but I don't have write permission to push a branch now, it gives 403 response.

sibiryakov commented 9 years ago

Can you fork Frontera to your own account, and use your local branch?

RajatGoyal commented 9 years ago

@sibiryakov Take a look at the above branch.