where number_of_jobs's value is defined by the SPIDERMON_JOBS_COMPARISON setting.
Issue
Supposedly, we should get the SPIDERMON_JOBS_COMPARISON number of previous jobs to compute the average, as stated in the docs, but we might get more because of how those jobs are iterated.
Debugging
The first _jobs = client.spider.jobs.list(..., count=number_of_jobs) call will get number_of_jobs at most, as it is expected.
The function then keeps iterating, skipping 1000 jobs ahead (start+= 1000) to fetch #number_of_jobs more.
That 1000 number might come from the fact that client.spider.jobs.list() doesn't allow the return of more than 1000 jobs, so the intention might have been to paginate the results to get everything. But because we only want #number_of_jobs (which should be less than 1000, otherwise the function breaks), then this causes us to fetch more jobs than requested instead.
It does so until there are no more jobs available to list
For example, if some spider has 3500 jobs that meet the requested criteria and SPIDERMON_JOBS_COMPARISON is set to 10, then this will happen:
Resulting in 40 previous jobs, when only ten were requested.
Expected Behaviour
Only SPIDERMON_JOBS_COMPARISON number of jobs should be fetched as previous jobs, and those should be the latest ones that meet the requested criteria.
Background
The method to get the previous jobs used in ZyteJobsComparisonMonitor can be seen here:
https://github.com/scrapinghub/spidermon/blob/a7e195f951a4b9a4837943dff0b2ae727b76f237/spidermon/contrib/scrapy/monitors/monitors.py#L557-L579
where
number_of_jobs
's value is defined by theSPIDERMON_JOBS_COMPARISON
setting.Issue
Supposedly, we should get the
SPIDERMON_JOBS_COMPARISON
number of previous jobs to compute the average, as stated in the docs, but we might get more because of how those jobs are iterated.Debugging
_jobs = client.spider.jobs.list(..., count=number_of_jobs)
call will getnumber_of_jobs
at most, as it is expected.#number_of_jobs
more.client.spider.jobs.list()
doesn't allow the return of more than 1000 jobs, so the intention might have been to paginate the results to get everything. But because we only want#number_of_jobs
(which should be less than 1000, otherwise the function breaks), then this causes us to fetch more jobs than requested instead.For example, if some spider has 3500 jobs that meet the requested criteria and SPIDERMON_JOBS_COMPARISON is set to 10, then this will happen:
Round #0
: len(_jobs) == 10 (client.spider.jobs.list(start=0, count=10))Round #1
: len(_jobs) == 20 (client.spider.jobs.list(start=1000, count=10))Round #2
: len(_jobs) == 30 (client.spider.jobs.list(start=2000, count=10))Round #3
: len(_jobs) == 40 (client.spider.jobs.list(start=3000, count=10))Round #5
: stop (client.spider.jobs.list(start=4000, count=10))Resulting in 40 previous jobs, when only ten were requested.
Expected Behaviour
Only SPIDERMON_JOBS_COMPARISON number of jobs should be fetched as previous jobs, and those should be the latest ones that meet the requested criteria.