scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
533 stars 97 forks source link

Per-field required percentage checking #151

Closed wRAR closed 4 years ago

wRAR commented 5 years ago

Right now Spidermon field percentage validation works in this manner:

The result is not very useful as it's not clear what threshold to set because all missing fields in all items are combined into one number.

What we actually need is a percentage check for each required field, or, better, separate percentages for all fields. How do I see this:

Some implementations I've seen:

The main question here is where to count the fields and where to store them. ItemValidationPipeline seems to be the ideal place for the counting code as it already processes all items. The data can be stored in the stats, though I'm not sure what is the policy of adding multiple new keys (one per field) to stats. If this is a problem the code can be disabled unless some setting is set.

The thresholds can be taken from settings and spider attributes, it should be possible to pass a single number or a dict of field: number mappings, I don't think this is a problem.

To keep the compatibility the old methods should be kept and new ones created, though I'm not sure if the old ones are useful as is.

We can also want to check percentages for subfields, it's possible with the current code as jsonschema emits errors for them too, though because of mangling during converting errors to stats keys some guesswork is needed to unmangle them. My current proposal doesn't take this additional feature into account, suggestions welcome.

muzaffaryousaf commented 5 years ago

@wRAR Can we change the existing methods and mechanism in such way that if additional/optional fields are provided then we do validation in the suggested way rather than having duplicated code with minor differences ?

ejulio commented 5 years ago

The idea looks good. My thoughts are:

manycoding commented 5 years ago

Assuming by schema you mean JSON schema, I would avoid having anything extra there apart from JSON schema stantard. The reason is that we already have several libraries with custom keywords and it already creates some misunderstandings.

wRAR commented 5 years ago

@ejulio missing_required_field stats are provided only for fields marked as required in the schema, and I don't like the idea of marking all or most fields as required, especially ones which we know are present on 20% of items. In my opinion only really required fields should be marked as such and should be expected to be present in all items, even though the current code allows for some tolerance, but that's another matter.

Adding a new pipeline has its benefits (no new code will be running unless you enable it), I just didn't like the idea of enabling two Spidermon pipelines, but maybe it's not bad.

Per-item-type settings is an important feature, and I'm sure it can be done somehow, even if it requires complicated settings. We can still make some simplified setting structure for projects with one item type.

@manycoding agreed.

ejulio commented 5 years ago

@wRAR, makes sense. I thought it was related to required items, not optional ones. Sorry!

wRAR commented 5 years ago

Yup. To clarify: in an established project with periodic recrawls you usually know expected percentages for many optional field and may want to be notified when it drops. I also think that making sure all fields have non-zero percentage can be useful, as zero percentage most likely means the website has changed, unless the data is very rare.

wRAR commented 5 years ago

We've decided to try this approach:

This should already be very useful. If this works I'll see if there is anything else useful that can be done.

wRAR commented 5 years ago

The first stage is merged. The next stage should be using the new code to implement the idea of per-field thresholds, most of it is probably out of scope of the mixin and will reside in user-created monitors based on the mixin. I think we can also create a monitor that tries to take the thresholds from the settings.

rennerocha commented 4 years ago

https://github.com/scrapinghub/spidermon/pull/262 provided stats for number of items returned and coverage by field, without relying on schemas. These stats can be compared between spiders executions and custom monitors can be used to find when we have a drop in coverage of a certain field.

https://github.com/scrapinghub/spidermon/pull/263 added a new built-in monitor and easy way to set coverage thresholds by field.

I believe that setting threshold should not be related to any specific JSON Schema, so in my opinion, these PRs can be used to solve the problems mentioned on this issue.