scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
533 stars 97 forks source link

Option to make validation errors ERRORs in job #50

Closed andrewbaxter closed 2 years ago

andrewbaxter commented 6 years ago

At the moment validation errors can be added to the items in the _validation key, but there doesn't seem to be any way to make it more obvious that some errors are incorrect.

rennerocha commented 5 years ago

What do you mean by more obvious? Do you have some scenario and an idea on how to improve this?

andrewbaxter commented 5 years ago

Thanks, forgot about these tickets! Basically right now even if there are dozens of validation errors, the job will finish with "0 errors" in the scrapy cloud page. In practice this makes it easy to think a job was okay when it actually had lots of data problems.

raphapassini commented 5 years ago

@andrewbaxter can you test this again with the branch master? I'm pretty sure that the errors are being reported as log.ERRORS and thus going to show up in Scrapy Cloud.

Gallaecio commented 5 years ago

@andrewbaxter @raphapassini @rennerocha I also believe that is the case with the latest version of spidermon. Does that mean #50 here and #47 can be closed?

andrewbaxter commented 5 years ago

Yep, this looks good to me! Thanks!

andrewbaxter commented 5 years ago

I guess I'll go ahead and close it.

andrewbaxter commented 5 years ago

Sorry @Gallaecio :( I was looking at the wrong logs...

Here's what my logs look like when I get a validation error currently:

2019-04-05 20:25:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://scrapinghub.com/>
{'v': 'hi', '_validation': defaultdict(<class 'list'>, {'v': ['Invalid int']})}
2019-04-05 20:25:45 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 20:25:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 876,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 18973,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 2,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 5, 11, 25, 45, 200465),
 'item_scraped_count': 1,
 'log_count/DEBUG': 5,
 'log_count/INFO': 9,
 'memusage/max': 64413696,
 'memusage/startup': 64413696,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spidermon/validation/fields': 1,
 'spidermon/validation/fields/errors': 1,
 'spidermon/validation/fields/errors/invalid_int': 1,
 'spidermon/validation/fields/errors/invalid_int/v': 1,
 'spidermon/validation/items': 1,
 'spidermon/validation/items/errors': 1,
 'spidermon/validation/validators': 1,
 'spidermon/validation/validators/item/jsonschema': True,
 'start_time': datetime.datetime(2019, 4, 5, 11, 25, 42, 717363)}
2019-04-05 20:25:45 [scrapy.core.engine] INFO: Spider closed (finished)

My settings are:

SPIDERMON_ENABLED = True
EXTENSIONS = {
    'spidermon.contrib.scrapy.extensions.Spidermon': 500,
}
ITEM_PIPELINES = {
    'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline': 800,
}
SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS = True
SPIDERMON_VALIDATION_SCHEMAS = [
    'xyz/schema.json',
]

I also tried without ADD_ERRORS_TO_ITEMS, but in that case I get neither an ERROR nor a warning... in fact, it's not clear what the validation does here other than increase a job stat.

rosheen33 commented 5 years ago

@andrewbaxter What kind of error you want here ? 'spidermon/validation/items/errors': 1, This field already describes how many items have validation errors

rosheen33 commented 5 years ago

@andrewbaxter If you want to get a warning on items, you can add the following setting SPIDERMON_VALIDATION_DROP_ITEMS_WITH_ERRORS = True https://spidermon.readthedocs.io/en/latest/item-validation.html#spidermon-validation-drop-items-with-errors

andrewbaxter commented 5 years ago

I think this was more about how the validation errors are logged/interact with the logging system.

Ideally, I'd like to see

'log_count/ERROR': 1,

and also have the errors appear in the logs with log level ERROR.

That way, say, in Scrapy Cloud, if you look at a job in the dashboard it will show "1 error" with a link, and when you click on it it will take you to the logs page with the ERROR filter already applied.

Right now even if there's lots of validation issues the job appears as finished with 0 errors.

rosheen33 commented 5 years ago

@andrewbaxter Maybe we can convert the Warning into an error on the above setting ? @rennerocha @raphapassini @Gallaecio What are your thoughts on this one ?

rennerocha commented 5 years ago

In my opinion, log/ERROR stat is not the right place to include the validation failures. It should be reserved to errors like code exceptions that affects the spider execution (not the data it was returned). I can see scenarios where is acceptable to have some number of validation failures and including them as errors may confuse the user.

When a monitor fails, it includes an error line in the logs, so I'd create a monitor that fails if spidermon/validation/items/errors is not equal to zero. When it fails, you'll see one error in Scrapy Cloud, and then you can check it.

Enabling SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS can help you to identify the items with problems (just look for the ones that have a _validation field filled).

rosheen33 commented 5 years ago

@rennerocha I agree That is exactly what we do in the maintenance projects [Attach a monitor to see how many items are not conforming to the validators] It is better to have a single error rather then a huge number of errors.

So should we close this issue ?

andrewbaxter commented 5 years ago

I think the severity of incorrect items depends on the project, and in projects where data quality needs to be high it's as significant an issue as any other error.

How about making whether it's an error or warning an option?

rosheen33 commented 5 years ago

@rennerocha @raphapassini Do we need a change here ?

rennerocha commented 2 years ago

Validation failures should not be logged as errors and added to log_count/ERROR stat. As mentioned before we have spidermon/validation/items/errors stat and SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS setting to add the validation error information into the items scraped so we can use monitors to raise errors when some threshold is not satisfied.

If we want a monitor to check the stats and then raise an error in the logs if it fails, we have the built-in monitor ItemValidationMonitor that can be easily included into the project. For more sophisticated validations, a custom monitor should be created.