simplecto / sitemap_grabber

A python library to recursively crawl every sitemap.xml for a website. Also handles robots.txt and other well-knowns.
MIT License
0 stars 0 forks source link

Our XML parsing crashed #7

Open heysamtexas opened 2 hours ago

heysamtexas commented 2 hours ago

Traceback (most recent call last): File "/lib/python3.12/xml/etree/ElementTree.py", line 1706, in feed self.parser.Parse(data, False) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 5808, column 55

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File ".venv/lib/python3.12/site-packages/sitemap_grabber/sitemap_grabber.py", line 114, in _process_sitemap_content root = fromstring(content) ^^^^^^^^^^^^^^^^^^^

During handling of the above exception, another exception occurred:

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 120, column 78

During handling of the above exception, another exception occurred:

heysamtexas commented 2 hours ago

Consider using beautifulsoup and parse it as html?

from bs4 import BeautifulSoup
   import requests

   url = "http://example.com/sitemap.xml"
   response = requests.get(url)
   soup = BeautifulSoup(response.content, 'xml')

   # Extract URLs from the sitemap
   urls = [loc.text for loc in soup.find_all('loc')]

   for url in urls:
       print(url)

or LXML with recover mode?

from lxml import etree
   import requests

   url = "http://example.com/sitemap.xml"
   response = requests.get(url)

   # Use a custom parser with recover mode
   parser = etree.XMLParser(recover=True)
   root = etree.fromstring(response.content, parser)

   # Extract URLs from the sitemap
   urls = root.xpath('//loc/text()')

   for url in urls:
       print(url)