openzim / ifixit

iFixit to ZIM scraper
GNU General Public License v3.0
25 stars 3 forks source link

ifixit2zim.exceptions.CategoryHomePageContentError: No footer stats found with selector 'div.footer-stats div' #88

Closed kelson42 closed 1 year ago

kelson42 commented 1 year ago

From https://farm.openzim.org/pipeline/f02f2d795a43e1404ca24a36/debug

[MainThread::2022-12-22 10:03:09,648] INFO:testing S3 Optimization Cache credentials
[MainThread::2022-12-22 10:03:12,642] INFO:Starting scraper with:
  language: Korean (ko.ifixit.com)
  output_dir: /output
  build_dir: /output/ifixit_ko_81mf9txz

  using cache: s3.us-west-1.wasabisys.com with bucket: org-kiwix-ifixit
[MainThread::2022-12-22 10:03:12,643] INFO:Fetching website metadata
[MainThread::2022-12-22 10:03:13,016] ERROR:FAILED. An error occurred: No footer stats found with selector 'div.footer-stats div'
[MainThread::2022-12-22 10:03:13,016] ERROR:No footer stats found with selector 'div.footer-stats div'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ifixit2zim-0.2.3-py3.8.egg/ifixit2zim/entrypoint.py", line 266, in main
    sys.exit(scraper.run())
  File "/usr/local/lib/python3.8/site-packages/ifixit2zim-0.2.3-py3.8.egg/ifixit2zim/scraper.py", line 167, in run
    Global.metadata = self.scraper_homepage.get_online_metadata()
  File "/usr/local/lib/python3.8/site-packages/ifixit2zim-0.2.3-py3.8.egg/ifixit2zim/scraper_homepage.py", line 518, in get_online_metadata
    "footer_stats": self._extract_footer_stats_from_page(soup),
  File "/usr/local/lib/python3.8/site-packages/ifixit2zim-0.2.3-py3.8.egg/ifixit2zim/scraper_homepage.py", line 431, in _extract_footer_stats_from_page
    raise CategoryHomePageContentError(
ifixit2zim.exceptions.CategoryHomePageContentError: No footer stats found with selector 'div.footer-stats div'

Seem a new crash scenario, maybe a change in upstream HTML?

benoit74 commented 1 year ago

Yep, probably pretty easy to fix, just need an adaptation to upstream HTML changes.

I can have a look into it pretty soon, probably this afternoon.

kelson42 commented 1 year ago

@benoit74 THX!

benoit74 commented 1 year ago

This was indeed due to upstream HTML changes.