scrapinghub / shub-workflow

BSD 3-Clause "New" or "Revised" License
13 stars 14 forks source link

Avoid duplicate scheduling of scripts #7

Closed Gallaecio closed 4 years ago

Gallaecio commented 4 years ago

When the first API request to schedule a script fails due to a request timeout, but the script job is successfully scheduled in Scrapy Cloud, the current implementation retries the script job scheduling, getting bad responses about the job being a duplicate of an existing one until the previously-scheduled job finishes. Then, a duplicate job is created.

I noticed that the counterpart implementation for spiders has a check in place to avoid this issue: if the API reports that the job is a duplicate of a running job, there are no more retries. There is no reason for this approach not to be also taken for scripts.

So I’ve extracted schedule_spider into _schedule_job, removed the unnecessary ‘spider’ references in log messages, and reimplemented schedule_script to use _schedule_job.

These changes include an API change, the removal of shub_workflow.utils.schedule_script_in_dash (which was also available at shub_workflow.script.schedule_script_in_dash).

CC: @hermit-crab (he discovered and diagnosed the original issue)

kalessin commented 4 years ago

@Gallaecio sorry the delay on this. It is merged now, and I released version 1.6.5 of shub-workflow on pypi.