scrapinghub / shub-workflow

BSD 3-Clause "New" or "Revised" License
13 stars 14 forks source link

issue(graphmanager): Issue when waiting on already scheduled jobs #53

Open curita opened 1 year ago

curita commented 1 year ago

Situation

  1. A task was supposed to be run
  2. It failed and was retried a couple of times (apparently because of a Timeout error)
  3. The task was scheduled nonetheless
  4. Eventually, the command to run the task succeeds in getting a response
  5. The response is that the job is already scheduled
  6. Because of this, the returned job_id is None (See BaseScript._schedule_job() and Task.run())
  7. This None job_id is then tracked among the running jobs (See GraphManager.run_pending_jobs())
  8. GraphManager waits for it to finish (See GraphManager.check_running_jobs())
  9. Because it's not an actual job key, it never finishes, so the GraphManager is stuck and doesn't move on to the next task.

Logs

Affected jobs:

2023-10-16 05:51:07.032000 INFO [shub_workflow.graph] Job discovery (265529/71/10349) still running
2023-10-16 05:53:07.597000 INFO [shub_workflow.graph] Job "graph_manager/discovery" (265529/71/10349) finished
2023-10-16 05:53:07.598000 INFO [shub_workflow.graph.task] Will start task "graph_manager/consumers"
2023-10-16 05:54:07.726000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:54:07.727000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:56:07.918000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:56:07.918000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 05:58:08.104000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 05:58:08.104000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:00:08.300000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:00:08.301000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:02:08.490000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:02:08.490000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:04:08.671000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:04:08.671000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:06:08.858000 ERROR [shub_workflow.script] Failed to schedule job with arguments {'spider': 'py:hcfmanager.py', 'add_tag': ['FLOW_ID=869a601a-24f7-4f3b-8ce4-2a883c2eaac6', 'PARENT_NAME=graph_manager', 'TASK_ID=consumers'], 'units': None, 'cmd_args': 'imdb.com_fresh imdb.com_fresh slot_ --loop-mode=60 "--frontera-settings-json={\\"BACKEND\\": \\"hcf_backend.HCFBackend\\", \\"HCF_PROJECT_ID\\": \\"265529\\", \\"HCF_AUTH\\": \\"153822e638164f449e637000b46194fb\\", \\"HCF_PRODUCER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_PRODUCER_SLOT_PREFIX\\": \\"slot_\\", \\"HCF_PRODUCER_NUMBER_OF_SLOTS\\": 11, \\"HCF_CONSUMER_FRONTIER\\": \\"imdb.com_fresh\\", \\"HCF_CONSUMER_MAX_BATCHES\\": 150, \\"MAX_NEXT_REQUESTS\\": 15000, \\"HCF_CONSUMER_DELETE_BATCHES_ON_STOP\\": true}" --max-running-jobs=11 "--job-settings={\\"SCHEDULER\\": \\"scrapy_frontera.scheduler.FronteraScheduler\\", \\"SPIDER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerSpiderMiddleware\\": 0}, \\"DOWNLOADER_MIDDLEWARES\\": {\\"netflix.scheduler.FronteraSchedulerDownloaderMiddleware\\": 0}, \\"FRONTERA_SCHEDULER_REQUEST_CALLBACKS_TO_FRONTIER\\": [\\"crawl_title\\", \\"crawl_person\\"], \\"FRONTERA_SCHEDULER_CALLBACK_SLOT_PREFIX_MAP\\": {\\"crawl_title\\": \\"slot_crawl_title_/11\\", \\"crawl_person\\": \\"slot_crawl_person_/11\\"}, \\"FRONTERA_SCHEDULER_SKIP_START_REQUESTS\\": true, \\"URLACCUMULATOR_SCHEDULE_URLS_ENABLED\\": false, \\"SPIDERMON_SPIDER_CLOSE_MONITORS\\": [\\"netflix.monitors.WorkflowConsumersCloseMonitorSuite\\"], \\"SPIDERMON_PERIODIC_MONITORS\\": {\\"netflix.monitors.WorkflowConsumersPeriodicMonitorSuite\\": 10800}}"', 'meta': None}
2023-10-16 06:06:08.858000 ERROR [shub_workflow.utils] Retrying shub_workflow.script.BaseScript._schedule_job in 60.0 seconds as it raised ReadTimeout: HTTPSConnectionPool(host='app.scrapinghub.com', port=443): Read timed out. (read timeout=60).
2023-10-16 06:08:10.015000 ERROR [shub_workflow.script] py:hcfmanager.py already scheduled
2023-10-16 06:10:10.333000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:12:10.659000 INFO [shub_workflow.graph] Job consumers (None) still running
2023-10-16 06:14:10.986000 INFO [shub_workflow.graph] Job consumers (None) still running

Proposal

I don't know what's the best way to address it. On one hand, this is a temporal issue in ScrapyCloud/JobQ, so this is something external to shub_workflow. On the other hand, there's been some instability on this API for quite a while, so it could be good to handle it, at least to avoid the jobs from being stuck.

I think if self.run_job() returns None in GraphManager, we should probably stop the waiting loop because we won't be tracking that task's associated job ID either way.

I wonder if getting the job ID from the already scheduled job could be possible, too, so GraphManager can recover from this situation without having to restart it manually.