scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
526 stars 94 forks source link

Use itemadapter #353

Closed kmike closed 1 year ago

kmike commented 2 years ago

I've seen some code in projects, with a pipeline like this:

# The pipeline needed to make Spidermon and attr items work nicely.
class ItemToDictTransformPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        return adapter.asdict()

This pipeline is not a proper solution, because it discards information about fields defined on the item, as well as field metadata. It means that exporting could work differently, e.g. CSV header may be set incorrectly.

I haven't checked why it's needed, but currently spidermon uses some scary code to convert items to dicts: https://github.com/scrapinghub/spidermon/blob/1838ec9e998c24fa64de38e50d5338ab4ce2b303/spidermon/contrib/scrapy/pipelines.py#L138

It seems it can be simplified, sped up and made more robust by using https://github.com/scrapy/itemadapter.

Other methods also don't seem to support all the items which Scrapy supports, e.g. _add_errors_to_item only supports Item and dict. Again, this can be simplified and made more correct by using itemadapter.

I think it makes sense to check the whole contrib.scrapy, there are other issues like this, e.g. https://github.com/scrapinghub/spidermon/blob/1838ec9e998c24fa64de38e50d5338ab4ce2b303/spidermon/contrib/scrapy/extensions.py#L135

VMRuiz commented 1 year ago

@kmike Can we close this?

Gallaecio commented 1 year ago

Makes sense after https://github.com/scrapinghub/spidermon/pull/358