Closed wRAR closed 4 years ago
@wRAR Can we change the existing methods and mechanism in such way that if additional/optional fields are provided then we do validation in the suggested way rather than having duplicated code with minor differences ?
The idea looks good. My thoughts are:
ItemValidationPipeline
? It stores missing fields as _spidermon/validation/fields/errors/missing_required_field/fieldnamecustom_settings
.dict
s or a number (default for all fields)ErrorCountMonitor
raise a notification. Ideally, if there is some amount of "expected errors", that monitor shouldn't fail if enabled.ItemValidationPipeline
.Assuming by schema you mean JSON schema, I would avoid having anything extra there apart from JSON schema stantard. The reason is that we already have several libraries with custom keywords and it already creates some misunderstandings.
@ejulio missing_required_field
stats are provided only for fields marked as required in the schema, and I don't like the idea of marking all or most fields as required, especially ones which we know are present on 20% of items. In my opinion only really required fields should be marked as such and should be expected to be present in all items, even though the current code allows for some tolerance, but that's another matter.
Adding a new pipeline has its benefits (no new code will be running unless you enable it), I just didn't like the idea of enabling two Spidermon pipelines, but maybe it's not bad.
Per-item-type settings is an important feature, and I'm sure it can be done somehow, even if it requires complicated settings. We can still make some simplified setting structure for projects with one item type.
@manycoding agreed.
@wRAR, makes sense. I thought it was related to required items, not optional ones. Sorry!
Yup. To clarify: in an established project with periodic recrawls you usually know expected percentages for many optional field and may want to be notified when it drops. I also think that making sure all fields have non-zero percentage can be useful, as zero percentage most likely means the website has changed, unless the data is very rare.
We've decided to try this approach:
This should already be very useful. If this works I'll see if there is anything else useful that can be done.
The first stage is merged. The next stage should be using the new code to implement the idea of per-field thresholds, most of it is probably out of scope of the mixin and will reside in user-created monitors based on the mixin. I think we can also create a monitor that tries to take the thresholds from the settings.
https://github.com/scrapinghub/spidermon/pull/262 provided stats for number of items returned and coverage by field, without relying on schemas. These stats can be compared between spiders executions and custom monitors can be used to find when we have a drop in coverage of a certain field.
https://github.com/scrapinghub/spidermon/pull/263 added a new built-in monitor and easy way to set coverage thresholds by field.
I believe that setting threshold should not be related to any specific JSON Schema, so in my opinion, these PRs can be used to solve the problems mentioned on this issue.
Right now Spidermon field percentage validation works in this manner:
The result is not very useful as it's not clear what threshold to set because all missing fields in all items are combined into one number.
What we actually need is a percentage check for each required field, or, better, separate percentages for all fields. How do I see this:
ValidationInfo
calculates percentages for each field by dividing counts by the item countValidationMonitorMixin
gets per-field, per-spider and per-project thresholds and compares the percentage for each field with the relevant thresholdSome implementations I've seen:
The main question here is where to count the fields and where to store them.
ItemValidationPipeline
seems to be the ideal place for the counting code as it already processes all items. The data can be stored in the stats, though I'm not sure what is the policy of adding multiple new keys (one per field) to stats. If this is a problem the code can be disabled unless some setting is set.The thresholds can be taken from settings and spider attributes, it should be possible to pass a single number or a dict of field: number mappings, I don't think this is a problem.
To keep the compatibility the old methods should be kept and new ones created, though I'm not sure if the old ones are useful as is.
We can also want to check percentages for subfields, it's possible with the current code as jsonschema emits errors for them too, though because of mangling during converting errors to stats keys some guesswork is needed to unmangle them. My current proposal doesn't take this additional feature into account, suggestions welcome.