rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.54k stars 1.59k forks source link

RFPDupeFilter:: It seems this doesn't work #162

Closed forgeries closed 2 years ago

forgeries commented 4 years ago
    def close(self, reason=''):
        """Delete data on close. Called by Scrapy's scheduler.

        Parameters
        ----------
        reason : str, optional

        """
        self.clear()

    def clear(self):
        """Clears fingerprints data."""
        self.server.delete(self.key)

How to delete duplicate files when the crawler task is all over.

rmax commented 4 years ago

You can delete the data directly using the redis CLI.

LuckyPigeon commented 2 years ago

Close this as solved

rpocase commented 2 years ago

For a more automatic solution, you could use a scrapy extension. Unfortunately, I can't share our exact solution, but its likely a generic extension could be implemented and provided by the base library.

At a high level: