rhgarcia / tropescraper

A tropes scraper
GNU Lesser General Public License v3.0
30 stars 10 forks source link

"Document is empty" error #19

Open JJ opened 3 years ago

JJ commented 3 years ago
INFO:tropescraper.adaptors.file_cache:Cache hit for https://tvtropes.org/pmwiki/pmwiki.php/Main/SuperpoweredRobotMeterMaids
Traceback (most recent call last):
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tropescraper/use_cases/scrape_tropes_use_case.py", line 203, in extract_all_tropes_in_page_recursively
    self.extract_all_tropes_in_page_recursively(page, subtrope, recursivity_level)
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tropescraper/use_cases/scrape_tropes_use_case.py", line 203, in extract_all_tropes_in_page_recursively
    self.extract_all_tropes_in_page_recursively(page, subtrope, recursivity_level)
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tropescraper/use_cases/scrape_tropes_use_case.py", line 203, in extract_all_tropes_in_page_recursively
    self.extract_all_tropes_in_page_recursively(page, subtrope, recursivity_level)
  [Previous line repeated 992 more times]
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tropescraper/use_cases/scrape_tropes_use_case.py", line 157, in extract_all_tropes_in_page_recursively
    links, films = self.parser.get_all_trope_links_and_paginations(page, trope_name)
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tropescraper/use_cases/parsers/tvtropes_parser.py", line 69, in get_all_trope_links_and_paginations
    links = self._get_links_from_page(page, self.MAIN_RESOURCE, only_article=True, remove_link=False,
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/tropescraper/use_cases/parsers/tvtropes_parser.py", line 78, in _get_links_from_page
    tree = html.fromstring(page)
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/lxml/html/__init__.py", line 875, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/home/jmerelo/.pyenv/versions/3.9.0/lib/python3.9/site-packages/lxml/html/__init__.py", line 763, in document_fromstring
    raise etree.ParserError(
lxml.etree.ParserError: Document is empty

It apparently can't recover from this one. And I dont' know which one is the document that is empty. Is there any way to find out where it's failing?