Closed curita closed 1 year ago
Fix in this commit https://github.com/scrapinghub/hcf-backend/commit/739c9b5ae5424d932aa9e497111bb09037a1c342
I released new version (0.5.2.3)
Thanks for reporting!
@kalessin Thanks for the quick turnaround! I think there might be some lingering issue, I'm getting this new error now with the latest hcf-backend
version:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
_run(args, settings)
File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 105, in _run
_run_pkgscript(args)
File "/usr/local/lib/python3.10/site-packages/sh_scrapy/crawl.py", line 128, in _run_pkgscript
d.run_script(scriptname, {'__name__': '__main__'})
File "/usr/local/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1472, in run_script
exec(code, namespace, namespace)
File "/tmp/unpacked-eggs/__main__.egg/EGG-INFO/scripts/hcfmanager.py", line 16, in <module>
manager.run()
File "/app/python/lib/python3.10/site-packages/shub_workflow/script.py", line 527, in run
for loop_result in self._run_loops():
File "/app/python/lib/python3.10/site-packages/shub_workflow/script.py", line 553, in _run_loops
yield self.workflow_loop()
File "/app/python/lib/python3.10/site-packages/hcf_backend/utils/crawlmanager.py", line 54, in workflow_loop
for job in self.get_owned_jobs(
File "/app/python/lib/python3.10/site-packages/shub_workflow/base.py", line 107, in get_owned_jobs
meta.append("tags")
AttributeError: 'str' object has no attribute 'append'
Fixed and released new version.
https://github.com/scrapinghub/hcf-backend/commit/d8643cc0cd711e8b7071f5c428998df53d1bbbef
Background
HCFCrawlManager's main workflow loop checks running or pending jobs of the same spider to determine which slots are available.
Issue
That loop doesn't consider whether or not those jobs belong to the same workflow as the root script. Jobs of the same spider run outside of HCFCrawlManager or in another instance of HCFCrawlManager (using a different frontier, for example) are also considered. The first case is problematic because jobs might not have
spider_args
, and the later calljob["spider_args"]
will throw a KeyError. The second case is problematic because we might remove slots from the available list if they use the same names, even if they use another frontier.Replicate
I ran a similar script as MyArticlesGraphManager described in https://github.com/scrapinghub/shub-workflow/wiki/Graph-Managers-with-HCF, and then when it was in the consumers/scrapers stage, I ran a regular job of the same spider outside the script.
py:hcf_crawlmanager.py
crashed because it considered that job, which didn't have aspider_args
argument, and couldn't recover from it.