Open VMRuiz opened 1 year ago
I like the idea. I do wonder if:
We should call these (feed) item pipelines, rather than item processors
I'm good with this as long as the overlap doesn't causes any confusion
It would make sense to make regular item pipelines compatible with them, even if we do not do the same the other way around (since one of the points of these pipelines is to support splitting items, which regular item pipelines cannot do). In other words, make the API of feed item pipelines a superset of the API of item pipelines that also supports splitting an item.
We could even make both APIs the same, and allow to also return multiple items from a Item Pipeline. So, each one can process the items wherever makes more sense to them.
Hello,
We're a team of software engineering students at KTH Royal Institute of Technology in Sweden. Our current assignment involves contributing to open-source projects, and we've selected to tackle this issue.
We have one question before we proceed: Is this feature still requested and relevant at present?
Thank you for your time and consideration.
It is still relevant indeed.
Summary
Create a new feature for Scrapy FeedExporter extension that allows the addition of methods to modify the content of items right before they get exported into the Feeds. This will enable exporting a single item as multiple entries independently on each Feed and making modifications on the item depending on the specific feed.
Motivation
The schema of the crawled data may differ from the schema used in the exported data, and users may need to normalize certain attributes before exporting them to certain feeds. For example, a single crawled item may contain multiple attributes that need to be normalized before being exported to CSV, such as a list of prices or variants. In such cases, a flexible and customizable way to modify the items right before they get exported into the feeds would be very useful, enabling users to tailor the exported data to their specific use cases. The proposed feature would address this need and provide users with more control over the exported data, without altering the core Scrapy logic or having to alter their item's schemas.
Describe alternatives you've considered
Currently, there are several alternatives to achieve this functionality, but all of them have certain disadvantages:
If you use
Spider Middleware
to convert 1 item into multiple items, all of them go through the item pipelines individually. This requires altering the schema if Spidermon schema validation is used, and would break the per feed customization requirement.If you use
Item Pipelines
, you would be able to skip the schema issue -as long as the item pipeline runs after the schema validation one -. However, You cannot convert a single item into multiple entries on the feed. Even if we extend the pipelines, it would still be difficult to redirect each item to their correspondent feed. Finally, there is still the issue that we would be yielding multiplescraped_item
signals for each one which may have unexpected side effects.You could also use a custom
BaseItemExporter
to implement this logic, but this would require reimplementing it for every single format, breaking the DRY principle.Additional context
The proposed feature would create a new setting named
item_processors
for each feed. This setting would allow users to define a list of chained functions, similar to how the ItemLoader processors work. Each function takes a list of n items and returns a list of m items. The resulting list of items is then passed to the ItemFilter and ItemExporter of the feed to be processed following the current workflow. This approach would enable users to export a single item as multiple items and enable or disable this functionality per feed level.Implementation Details
To implement this feature, we can redirect the current item_scraped signal method to the new item_processors pipeline. This can be achieved by connecting the Scrapy signal to the new method using crawler.signals.connect():
Next, we need to define the item_processors method. This method will take an item and a spider as input, run the item through the item_processors pipeline, and call the item_scraped method for each resulting item. Here's a possible implementation:
The
item_processors
method first runs the item through the item_processors pipeline using the_run_processors
method (Top be implemented). This method takes a list of items and a list of processing functions, and applies each function to the items in sequence. The resulting list of items is then returned.After running the item through the item_processors pipeline, the item_processors method calls the
item_scraped
method with each new item. This ensures that each item is processed and exported according to the current feed's configuration.