scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
523 stars 94 forks source link

Unable to validate date and date-time with jsonschema #420

Open rennerocha opened 9 months ago

rennerocha commented 9 months ago

After https://github.com/scrapinghub/spidermon/pull/358, the validation of date fields using jsonschema is not working as before. Spidermon was serializing date fields into strings (https://github.com/scrapinghub/spidermon/pull/358/files#diff-7937ac85a30630fe837b9c133f4459ee590680bb5dfce72775db6005f2b45f51L142), so when injected into jsonschema validators, the date and date-time checkers (https://python-jsonschema.readthedocs.io/en/stable/validate/#validating-formats) didn't work as expected if the item contains a datetime.date or a datetime.datetime instance.

Given the code:

import datetime
from jsonschema._format import FormatChecker
from jsonschema.validators import validator_for
from spidermon.contrib.scrapy.pipelines import ItemValidationPipeline

format_checker = FormatChecker()

schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
        "date": {
            "description": "Date of the gazzete",
            "type": "string",
            "format": "date"
        }
    },
    "required": [
        "date",
    ]
}

validator_cls = validator_for(schema)
validator = validator_cls(schema=schema, format_checker=format_checker)
original_data = {
    'date': datetime.date.today()
}

Validating with spidermon 1.20.0

item_adapter = ItemAdapter(original_data)
item_dict = item_adapter.asdict()
>>> errors = validator.iter_errors(item_dict)
>>> [error for error in errors]
<ValidationError: "datetime.date(2023, 9, 19) is not of type 'string'">]

With spidermon 1.17.0

>>> data = ItemValidationPipeline._convert_item_to_dict(_, original_data)
>>> errors = validator.iter_errors(data)
>>> [error for error in errors]
[]

Validating with spidermon 1.20.0

>>> errors = validator.iter_errors(data)
>>> [error for error in errors]
<ValidationError: "datetime.date(2023, 9, 19) is not of type 'string'">]
rennerocha commented 9 months ago

This change has the potential to break applications that are relying that Spidermon will understand date and datetime values and validate them with jsonschema.

To make it work, the user needs to manually serialize the date and datetime values in the items. But I am trying to figure out if there some solution that could be implemented in Spidermon side, to avoid this manipulation.

cc @VMRuiz @Gallaecio

VMRuiz commented 2 months ago

Hey, sorry for getting back to you late on this. I'm not entirely sure if we should change anything here. If you want your field to be a string with a date format, you could scrape it that way or set up an item pipeline to automatically convert datetime objects into strings if that's easier for you.

I don't think Spidermon should make that decision for you by default. But I'm open to the idea of adding it as an opt-in feature where you can configure auto-casting methods for your fields. It could come in handy, especially when you want to validate with Jsonschema but still keep the original data types, like for binary RPC calls.

What do you think @Gallaecio @curita ?