Per-field required percentage checking

wRAR commented 5 years ago

Right now Spidermon field percentage validation works in this manner:

you set some of your fields as required in the schema
the validation code counts all missing required fields in all items, by catching schema validation errors
the error count is divided by the item count

The result is not very useful as it's not clear what threshold to set because all missing fields in all items are combined into one number.

What we actually need is a percentage check for each required field, or, better, separate percentages for all fields. How do I see this:

something calculates the field count for each field
ValidationInfo calculates percentages for each field by dividing counts by the item count
ValidationMonitorMixin gets per-field, per-spider and per-project thresholds and compares the percentage for each field with the relevant threshold
an error, or a bunch of errors, is emitted, listing all failing fields with their actual and expected percentages

Some implementations I've seen:

count items in a pipeline, store counts in the Spider object; get the thresholds from settings and spider attributes
put thresholds in the JSON schema for required fields and retrieve them in the monitor; compare per-field validation errors (which spidermon already counts) with the thresholds

The main question here is where to count the fields and where to store them. ItemValidationPipeline seems to be the ideal place for the counting code as it already processes all items. The data can be stored in the stats, though I'm not sure what is the policy of adding multiple new keys (one per field) to stats. If this is a problem the code can be disabled unless some setting is set.

The thresholds can be taken from settings and spider attributes, it should be possible to pass a single number or a dict of field: number mappings, I don't think this is a problem.

To keep the compatibility the old methods should be kept and new ones created, though I'm not sure if the old ones are useful as is.

We can also want to check percentages for subfields, it's possible with the current code as jsonschema emits errors for them too, though because of mangling during converting errors to stats keys some guesswork is needed to unmangle them. My current proposal doesn't take this additional feature into account, suggestions welcome.

muzaffaryousaf commented 5 years ago

@wRAR Can we change the existing methods and mechanism in such way that if additional/optional fields are provided then we do validation in the suggested way rather than having duplicated code with minor differences ?

ejulio commented 5 years ago

The idea looks good. My thoughts are:

Can't we use the stats already provided by ItemValidationPipeline? It stores missing fields as _spidermon/validation/fields/errors/missing_required_field/fieldname
The thresholds should be settings, so that we can set them in settings.py or thought custom_settings.
Probably, they should be dicts or a number (default for all fields)
Maybe, we'll need to think that these could be configurable per item type, similar to setting a different schema for different items
Currently, if the validation fails it logs an error and makes ErrorCountMonitor raise a notification. Ideally, if there is some amount of "expected errors", that monitor shouldn't fail if enabled.
If we follow the pipeline approach, maybe creating a new one is better than adding a new configuration for the current ItemValidationPipeline.

manycoding commented 5 years ago

Assuming by schema you mean JSON schema, I would avoid having anything extra there apart from JSON schema stantard. The reason is that we already have several libraries with custom keywords and it already creates some misunderstandings.

wRAR commented 5 years ago

@ejulio missing_required_field stats are provided only for fields marked as required in the schema, and I don't like the idea of marking all or most fields as required, especially ones which we know are present on 20% of items. In my opinion only really required fields should be marked as such and should be expected to be present in all items, even though the current code allows for some tolerance, but that's another matter.

Adding a new pipeline has its benefits (no new code will be running unless you enable it), I just didn't like the idea of enabling two Spidermon pipelines, but maybe it's not bad.

Per-item-type settings is an important feature, and I'm sure it can be done somehow, even if it requires complicated settings. We can still make some simplified setting structure for projects with one item type.

@manycoding agreed.

ejulio commented 5 years ago

@wRAR, makes sense. I thought it was related to required items, not optional ones. Sorry!

wRAR commented 5 years ago

Yup. To clarify: in an established project with periodic recrawls you usually know expected percentages for many optional field and may want to be notified when it drops. I also think that making sure all fields have non-zero percentage can be useful, as zero percentage most likely means the website has changed, unless the data is very rare.

wRAR commented 5 years ago

We've decided to try this approach:

as a first step the current code will be changed to check missing field percentage per field, this can already be done using the existing data from validation; this will work only for fields marked as required in the schema; this changes the existing code behavior but I don't think that code is useful in the current shape, as explained at the beginning of the initial post.
then it can be expanded to optionally take more specific threshold percentages.

This should already be very useful. If this works I'll see if there is anything else useful that can be done.

wRAR commented 5 years ago

The first stage is merged. The next stage should be using the new code to implement the idea of per-field thresholds, most of it is probably out of scope of the mixin and will reside in user-created monitors based on the mixin. I think we can also create a monitor that tries to take the thresholds from the settings.

rennerocha commented 4 years ago

https://github.com/scrapinghub/spidermon/pull/262 provided stats for number of items returned and coverage by field, without relying on schemas. These stats can be compared between spiders executions and custom monitors can be used to find when we have a drop in coverage of a certain field.

https://github.com/scrapinghub/spidermon/pull/263 added a new built-in monitor and easy way to set coverage thresholds by field.

I believe that setting threshold should not be related to any specific JSON Schema, so in my opinion, these PRs can be used to solve the problems mentioned on this issue.

scrapinghub / spidermon

Per-field required percentage checking #151