scrapinghub / hcf-backend

Crawl Frontier HCF backend
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

HCFCrawlManager should only consider jobs that are part of the workflow #25

Closed curita closed 1 year ago

curita commented 1 year ago

Background

HCFCrawlManager's main workflow loop checks running or pending jobs of the same spider to determine which slots are available.

    def workflow_loop(self):
        available_slots = self.print_frontier_status()

        running_jobs = 0
        states = "running", "pending"
        for state in states:
            for job in self.get_project().jobs.list(
                spider=self.args.spider, state=state, meta="spider_args"
            ):
                frontera_settings_json = json.loads(
                    job["spider_args"].get("frontera_settings_json", "{}")
                )
                if "HCF_CONSUMER_SLOT" in frontera_settings_json:
                    slot = frontera_settings_json["HCF_CONSUMER_SLOT"]
                    if slot in available_slots:
                        available_slots.discard(slot)
                        running_jobs += 1

        ...

Issue

That loop doesn't consider whether or not those jobs belong to the same workflow as the root script. Jobs of the same spider run outside of HCFCrawlManager or in another instance of HCFCrawlManager (using a different frontier, for example) are also considered. The first case is problematic because jobs might not have spider_args, and the later call job["spider_args"] will throw a KeyError. The second case is problematic because we might remove slots from the available list if they use the same names, even if they use another frontier.

Replicate

I ran a similar script as MyArticlesGraphManager described in https://github.com/scrapinghub/shub-workflow/wiki/Graph-Managers-with-HCF, and then when it was in the consumers/scrapers stage, I ran a regular job of the same spider outside the script. py:hcf_crawlmanager.py crashed because it considered that job, which didn't have a spider_args argument, and couldn't recover from it.

kalessin commented 1 year ago

Fix in this commit https://github.com/scrapinghub/hcf-backend/commit/739c9b5ae5424d932aa9e497111bb09037a1c342

I released new version (0.5.2.3)

Thanks for reporting!

curita commented 1 year ago

@kalessin Thanks for the quick turnaround! I think there might be some lingering issue, I'm getting this new error now with the latest hcf-backend version:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
    _run(args, settings)
  File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 105, in _run
    _run_pkgscript(args)
  File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 128, in _run_pkgscript
    d.run_script(scriptname, {'__name__': '__main__'})
  File "/usr/local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1472, in run_script
    exec(code, namespace, namespace)
  File "/tmp/unpacked-eggs/__main__.egg/EGG-INFO/scripts/hcfmanager.py", line 16, in <module>
    manager.run()
  File "/app/python/lib/python3.10/site-packages/shub_workflow/script.py", line 527, in run
    for loop_result in self._run_loops():
  File "/app/python/lib/python3.10/site-packages/shub_workflow/script.py", line 553, in _run_loops
    yield self.workflow_loop()
  File "/app/python/lib/python3.10/site-packages/hcf_backend/utils/crawlmanager.py", line 54, in workflow_loop
    for job in self.get_owned_jobs(
  File "/app/python/lib/python3.10/site-packages/shub_workflow/base.py", line 107, in get_owned_jobs
    meta.append("tags")
AttributeError: 'str' object has no attribute 'append'
kalessin commented 1 year ago

Fixed and released new version.

https://github.com/scrapinghub/hcf-backend/commit/d8643cc0cd711e8b7071f5c428998df53d1bbbef