scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
526 stars 94 forks source link

Spidermon fails to add `_validation` field to the item. #379

Closed proway2 closed 3 months ago

proway2 commented 1 year ago

Ubuntu 22.04, Python 3.10.9, Spidermon 1.17.1

If an item defined with attrs library and this condition https://github.com/scrapinghub/spidermon/blob/master/spidermon/contrib/scrapy/pipelines.py#L140 is True, then it fails here https://github.com/scrapinghub/spidermon/blob/master/spidermon/contrib/scrapy/pipelines.py#L141 with this message:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
TypeError: 'ProductList' object does not support item assignment
VMRuiz commented 1 year ago

This is a limitation on the Item definition as it don't support adding new fields.

In theory, using slots=False should work according to the attr dosc, however it didn't work for me.

Alternatively, adding the field _validation to the class should remove the error, however, you will get _validation automatically populated to None on each item which is not the best solution either.

On my experience I had to just convert from Attrs class to dict before spidermon validaton.

proway2 commented 1 year ago

@VMRuiz I had a hope that with this PR we wouldn't need this conversion-to-dict pipeline anymore.

VMRuiz commented 1 year ago

That class is just an interface to simplify read and write access to the different types of Item classes. It doesn't add any additional functionality.

I don't know if we could overcharge it to support additional fields not defined on the original item class.

VMRuiz commented 1 year ago

Maybe we can start by capturing the exception and raising a warning instead.

cc. @Gallaecio @rennerocha

Gallaecio commented 1 year ago

Alternatively, adding the field _validation to the class should remove the error

I think this is the way to go. You either disable the creation of this _validation field, if Spidermon allows for that, or you add the field to your item class.

VMRuiz commented 1 year ago

@Gallaecio Shall we close this ticket as wont fix or similar?

Gallaecio commented 1 year ago

Well, I think there might be an action here even if we go with https://github.com/scrapinghub/spidermon/issues/379#issuecomment-1385349452.

VMRuiz commented 1 year ago

This can be turn off with SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS - https://spidermon.readthedocs.io/en/latest/item-validation.html#spidermon-validation-add-errors-to-items

We may expand the docs to explicitly state that if enabled, it will only work on items that have defined the _validation field or that support adding undefined fields, like dicts .

Edit: Additionally, SPIDERMON_VALIDATION_ERRORS_FIELD can be use to replace the _validation field with any other field name.

rennerocha commented 10 months ago

This problem was added by https://github.com/scrapinghub/spidermon/pull/358 and it is a backward incompatible change that we didn't inform our users that it was introduced. So we need to do something about it. Probably one new release that indicates that _validation field will only work if we are dealing with dict item.

rennerocha commented 10 months ago

We already have this situation in Scrapy related to files and files_urls field for the media pipelines (https://docs.scrapy.org/en/latest/topics/media-pipeline.html#usage-example). So adding the information that we need to specify _validation should be enough. However, I believe that we should put as a warning, not just in the middle of the text.