scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
53.24k stars 10.57k forks source link

Add item_processors feature to Scrapy FeedExporter extension #5905

Open VMRuiz opened 1 year ago

VMRuiz commented 1 year ago

Summary

Create a new feature for Scrapy FeedExporter extension that allows the addition of methods to modify the content of items right before they get exported into the Feeds. This will enable exporting a single item as multiple entries independently on each Feed and making modifications on the item depending on the specific feed.

Motivation

The schema of the crawled data may differ from the schema used in the exported data, and users may need to normalize certain attributes before exporting them to certain feeds. For example, a single crawled item may contain multiple attributes that need to be normalized before being exported to CSV, such as a list of prices or variants. In such cases, a flexible and customizable way to modify the items right before they get exported into the feeds would be very useful, enabling users to tailor the exported data to their specific use cases. The proposed feature would address this need and provide users with more control over the exported data, without altering the core Scrapy logic or having to alter their item's schemas.

Describe alternatives you've considered

Currently, there are several alternatives to achieve this functionality, but all of them have certain disadvantages:

If you use Spider Middleware to convert 1 item into multiple items, all of them go through the item pipelines individually. This requires altering the schema if Spidermon schema validation is used, and would break the per feed customization requirement.

If you use Item Pipelines, you would be able to skip the schema issue -as long as the item pipeline runs after the schema validation one -. However, You cannot convert a single item into multiple entries on the feed. Even if we extend the pipelines, it would still be difficult to redirect each item to their correspondent feed. Finally, there is still the issue that we would be yielding multiple scraped_item signals for each one which may have unexpected side effects.

You could also use a custom BaseItemExporter to implement this logic, but this would require reimplementing it for every single format, breaking the DRY principle.

Additional context

The proposed feature would create a new setting named item_processors for each feed. This setting would allow users to define a list of chained functions, similar to how the ItemLoader processors work. Each function takes a list of n items and returns a list of m items. The resulting list of items is then passed to the ItemFilter and ItemExporter of the feed to be processed following the current workflow. This approach would enable users to export a single item as multiple items and enable or disable this functionality per feed level.

Implementation Details

To implement this feature, we can redirect the current item_scraped signal method to the new item_processors pipeline. This can be achieved by connecting the Scrapy signal to the new method using crawler.signals.connect():

# Redirect scrapy signal to new method
crawler.signals.connect(exporter.item_scraped, signals.item_processors)

Next, we need to define the item_processors method. This method will take an item and a spider as input, run the item through the item_processors pipeline, and call the item_scraped method for each resulting item. Here's a possible implementation:

def item_processors(self, item, spider):
    # Run the item into thrue the item_processors_methods
    exported_items = self._run_processors([item], *self.item_processors_methods)

    # Call item_scraped method with each new item
    for item in exported_items:
        self.item_scraped(item, spider)

The item_processors method first runs the item through the item_processors pipeline using the _run_processors method (Top be implemented). This method takes a list of items and a list of processing functions, and applies each function to the items in sequence. The resulting list of items is then returned.

After running the item through the item_processors pipeline, the item_processors method calls the item_scraped method with each new item. This ensures that each item is processed and exported according to the current feed's configuration.

Gallaecio commented 1 year ago

I like the idea. I do wonder if:

VMRuiz commented 1 year ago

We should call these (feed) item pipelines, rather than item processors

I'm good with this as long as the overlap doesn't causes any confusion

It would make sense to make regular item pipelines compatible with them, even if we do not do the same the other way around (since one of the points of these pipelines is to support splitting items, which regular item pipelines cannot do). In other words, make the API of feed item pipelines a superset of the API of item pipelines that also supports splitting an item.

We could even make both APIs the same, and allow to also return multiple items from a Item Pipeline. So, each one can process the items wherever makes more sense to them.

eloidieme commented 9 months ago

Hello,

We're a team of software engineering students at KTH Royal Institute of Technology in Sweden. Our current assignment involves contributing to open-source projects, and we've selected to tackle this issue.

We have one question before we proceed: Is this feature still requested and relevant at present?

Thank you for your time and consideration.

Gallaecio commented 9 months ago

It is still relevant indeed.