scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
53.24k stars 10.57k forks source link

Base classes for the item pipeline and middleware #2633

Open jorenham opened 7 years ago

jorenham commented 7 years ago

In the pipeline docs it says:

Each item pipeline component is a Python class that must implement the following method

I believe that it would be easier for the user and for documentation purposes to have an abstract class e.g. ItemPipeline available that raises a NotImplementedError if they forget to implement the required methods. This could also be extended to the optional methods so the user can see directly from the code which methods are supported.

This could also be applied to the downloader middleware, spider middleware and extensions.

jorenham commented 7 years ago

If this feature would be appreciated, I'll get started on a pull-request

kmike commented 7 years ago

Hey @jorenham,

There are no required methods in scrapy pipelines, middlewares and extensions, all methods are optional. So the tricky part is to figure out what should these base classes contain. We also need to make sure pipelines/middleware/extensions which are not subclasses of these base classes still work.

The current solution to documentation problem is to generate extension/middleware/pipeline templates, with all these methods - each new Scrapy project has them.

But I also find myself looking at docs each time I'm implementing a middleware or a pipeline, I feel the pain. Also, sometimes you have a middlewares.py file with a middleware and it is not immediately clear if it is a downloader or a spider middleware; inheriting from a certain class could help with readability.

jorenham commented 7 years ago

Like @eLRuLL mentioned in #2657, it would be a nice feature to have the base classes of spider/downloader middleware and extensions self register to the settings with a priority defined in the subclass. I figured there are two ways of implementing:

  1. Create a method in the base classes that registers itself to the default settings (e.g. def register(self, priority). This method should in turn be called from the subclasses.
  2. Metaclass hacking; register to the default settings once the base classes get subclassed with priority as an attribute of the subclasses.

What would be the best solution in your opion? @eLRuLL @kmike

djunzu commented 7 years ago

@jorenham , I would suggest to make this self registering in a separated issue/PR.

xPi2 commented 4 years ago

Hi, why is this still open? No decision taken?