scrapinghub / spidermon

Scrapy Extension for monitoring spiders execution.
https://spidermon.readthedocs.io
BSD 3-Clause "New" or "Revised" License
526 stars 94 forks source link

idea: Percentage thresholds for UnwantedHTTPCodesMonitor #376

Closed curita closed 1 year ago

curita commented 1 year ago

Background

At Zyte, we have internal custom monitors that check the percentage of response failures and the percentage of ignored responses ([private] reference).

Setting percentage thresholds instead of fixed max counts has been beneficial in projects where the number of initial requests varies. @gutsytechster suggested we could also use percentage thresholds for the unwanted HTTP status codes ([private] thread).

Current Status

At the moment, UnwantedHTTPCodesMonitor thresholds can be either set via a general SPIDERMON_UNWANTED_HTTP_CODES_MAX_COUNT or a SPIDERMON_UNWANTED_HTTP_CODES dictionary, where the keys are the HTTP codes and the values are the max allowed responses with those status codes.

Suggestion

We could add an alternative SPIDERMON_UNWANTED_HTTP_CODES_THRESHOLDS setting (or continue overloading SPIDERMON_UNWANTED_HTTP_CODES, but that might be confusing) for when we want to set percentage thresholds for unwanted response codes instead of fixed max counts. The option to set max counts should still exist.

Gallaecio commented 1 year ago

Sounds good to me, although I think overloading SPIDERMON_UNWANTED_HTTP_CODES for this may be better, so that in the same setting you can configure percentage values for some codes and absolute values for other codes, instead of combining different settings.

VMRuiz commented 1 year ago

@Gallaecio , Do you have any example of how to configure both fixed and percent based values using the same setting?

Gallaecio commented 1 year ago

Currently we have:

SPIDERMON_UNWANTED_HTTP_CODES = {
    400: 100,
    500: 0,
}

I am thinking that we could continue to support that, but also something like {"value": 5, "unit": "%"} instead of an int, e.g.

SPIDERMON_UNWANTED_HTTP_CODES = {
    400: {"value": 2.5, "unit": "%"},
    500: 0,
}

I can think of alternatives. For example, assuming it is OK not to support 0% (since it is the same as 0) or 100% (since it is the same as not having the status code in the setting in the first place), we could actually make it so float values mean percentages, since they do not make sense otherwise, e.g.

SPIDERMON_UNWANTED_HTTP_CODES = {
    400: 2.5,
    500: 0,
}

However, while this could work and is shorter, I think the dictionary approach is more intuitive, i.e. easier to understand for those unfamiliar with the setting.

We could also deprecate the int approach in favor of a dict approach for units as well. No strong opinion either way on that one.

And at some point we might want to support defining things like “max(5%, 100)“ or ”min(5%, 100)” in the same setting. But I guess we will cross that bridge when we get to it.

VMRuiz commented 1 year ago

Maybe we can add some syntax sugar wrappers ?

SPIDERMON_UNWANTED_HTTP_CODES = {
    400: MAX_PERCENT_REQUEST(10), 
    500: MAX_TOTAL_REQUEST(10),
}

MAX_NUM_REQUEST(0), MAX_PERCENT_REQUEST(100) , MAX_PERCENT_REQUEST(0) would mean to ignore.

Current int values would interpreted MAX_NUM_REQUEST(n) to be backward compatible.

Anyway, I think this would require to refactor many others monitors to make this functionality compatible across all of them. Otherwise, it can get very confusing and frustracting to have some monitor supporting it and others that doesn't without any particular logic behind it.

Gallaecio commented 1 year ago

I would avoid the syntax sugar since it does not add much value, and I think it is better to keep the syntax as similar as possible between Python and non-Python contexts (e.g. in the command line or in Scrapy Cloud you would not be able to use this syntax sugar).

We could go with deprecating the int approach, and go for something like this if you find it more readable:

SPIDERMON_UNWANTED_HTTP_CODES = {
    400: {"max_requests": 10, "unit": "%"}, 
    500: {"max_requests": 10, "unit": "unit"},
}

"unit": "unit" could be confusing, but I cannot see a clear alternative ("unit": "count"? "unit": "requests"?).

VMRuiz commented 1 year ago

it is better to keep the syntax as similar as possible between Python and non-Python contexts (e.g. in the command line or in Scrapy Cloud

I find it difficult to use either of these in command line or Scrapy Cloud anyway. Also, I don't think this variable will change so much between scrapy jobs that it needs to be pass it as argument. However, I'm not strongly against using dicts either. So let see if others have any preference on this.

"unit": "unit" could be confusing, but I cannot see a clear alternative ("unit": "count"? "unit": "requests"?).

requests is more clear, but also more verbose.