Open heysamtexas opened 2 hours ago
Consider using beautifulsoup and parse it as html?
from bs4 import BeautifulSoup
import requests
url = "http://example.com/sitemap.xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
# Extract URLs from the sitemap
urls = [loc.text for loc in soup.find_all('loc')]
for url in urls:
print(url)
or LXML with recover mode?
from lxml import etree
import requests
url = "http://example.com/sitemap.xml"
response = requests.get(url)
# Use a custom parser with recover mode
parser = etree.XMLParser(recover=True)
root = etree.fromstring(response.content, parser)
# Extract URLs from the sitemap
urls = root.xpath('//loc/text()')
for url in urls:
print(url)
Traceback (most recent call last): File "/lib/python3.12/xml/etree/ElementTree.py", line 1706, in feed self.parser.Parse(data, False) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 5808, column 55
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File ".venv/lib/python3.12/site-packages/sitemap_grabber/sitemap_grabber.py", line 114, in _process_sitemap_content root = fromstring(content) ^^^^^^^^^^^^^^^^^^^
During handling of the above exception, another exception occurred:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 120, column 78
During handling of the above exception, another exception occurred: