scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.3k stars 217 forks source link

setting to switch off exception when encountering same url fingerprint #66

Open RajatGoyal opened 9 years ago

RajatGoyal commented 9 years ago

I am trying to run multiple spiders with rdbms backend, the spiders are such that they might find the url that was visited by other spider, frontera raises as exception in this case, Is the expected behaviour?

I would want to not throw error when same url is encountered, I just want to not crawl it again, but don't need to throw exception.

Error: sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (psycopg2.IntegrityError) duplicate key value violates unique constraint "play_store_pkey" DETAIL: Key (fingerprint)=(f259306fa30657ab28ffa1c322d843d0cdceee41) already exists. [SQL: 'INSERT INTO play_store (url, fingerprint, depth, created_at, status_code, state, error, meta, headers, cookies, method, body) VALUES (%(url)s, %(fingerprint)s, %(depth)s, %(created_at)s, %(status_code)s, %(state)s, %(error)s, %(meta)s, %(headers)s, %(cookies)s, %(method)s, %(body)s)'] [parameters: {'body': None, 'cookies': <psycopg2.extensions.Binary object at 0x7f3e8a02c5a8>, 'url': 'https://play.google.com/store/apps/details?id=com.dic_o.dico_eng_fra', 'status_code': None, 'created_at': '20150909151342577212', 'error': None, 'state': 'NOT CRAWLED', 'headers': <psycopg2.extensions.Binary object at 0x7f3e89d6c238>, 'depth': 6, 'meta': <psycopg2.extensions.Binary object at 0x7f3e89d6c440>, 'fingerprint': 'f259306fa30657ab28ffa1c322d843d0cdceee41', 'method': 'GET'}]

sibiryakov commented 9 years ago

@RajatGoyal are you trying to do that in the same process? Or few different processes using the same database? Actually, Frontera wasn't tested in both of these configurations, and do not expected to work.

sibiryakov commented 9 years ago

@RajatGoyal Please tell more about overall problem you're trying to solve, so I be able to suggest a better architecture.

RajatGoyal commented 9 years ago

@sibiryakov Thanks for the quick reply :), I am trying to do that with different processes and same database. Basically, I am trying to crawl play store, I do this by visiting similar apps section of every app, and many apps have same similar apps, so each process encounters same page which other process has encountered, and the process which is encounters the link after other process raises an exception.

sibiryakov commented 9 years ago

SQLiteBackend isn't design for parallel access from different processes. During intensive writes there's a high probability that writing process will have outdated state and based on that will try to insert a row instead of update. I would recommend to stay within one process and tune Scrapy delays and auto-throttling to get necessary request rate.

If this isn't suitable solution in your case, then let me know, I'll try to figure something else.

RajatGoyal commented 9 years ago

I am not using SQLite, I am using Postgres. The two changes that I have made for the database to remain consistent and not write intensive for parallel processes is:

  1. Locked the database during fetching next request:
    def get_next_requests(self, max_next_requests, **kwargs):
        query = self.page_model.query(self.session).with_lockmode('update')
        query = query.filter(self.page_model.state == PageMixin.State.NOT_CRAWLED)
  1. Introduced a new variable UPDATE_STATUS_AFTER:
 def page_crawled(self, response, links):
        db_page, _ = self._get_or_create_db_page(response)
        db_page.state = PageMixin.State.CRAWLED
        db_page.status_code = response.status_code
        self.pages_crawled_in_current_batch += 1
        for link in links:
            db_page_from_link, created = self._get_or_create_db_page(link)
            if created:
                db_page_from_link.depth = db_page.depth+1
            self.pages_crawled_in_current_batch += 1

        if self.pages_crawled_in_current_batch and self.pages_crawled_in_current_batch > \
                UPDATE_STATUS_AFTER:
            self.session.commit()
            self.pages_crawled_in_current_batch = 0

What do you think about these changes? I don't think one process would be enough for me, I still haven't been able to solve the above mentioned error.

sibiryakov commented 9 years ago
  1. @RajatGoyal tell me about your desired RPS (can do it privately, or by jabber).
  2. _get_or_create_db_page(response) seems to be a hot point. When several processes are first querying for row presence, and depending on the answer trying to insert new record, it will end up with one of them doing it first and rest of them will get duplicate record exception. I think UPDATE_STATUS_AFTER will not help in this situation.
RajatGoyal commented 9 years ago
  1. @sibiryakov What is your id?
  2. Agreed.

I have tested my solution when two process don't come across same url, in that case it works perfectly with locking. But we still have to find a solution for this use case.

sibiryakov commented 9 years ago

Try sibiryakov [!!] scrapinghub.com