simplecto / sitemap_grabber

A python library to recursively crawl every sitemap.xml and every url in those sitemaps.
MIT License
0 stars 0 forks source link

sitemap.xml validation needs to be more loose (Example inside) #6

Open undernewmanagement opened 3 weeks ago

undernewmanagement commented 3 weeks ago

Have a look at this sitemap: https://blockworks.co/price-sitemap-index.xml

we currently validate on the first line from the sitemap_grabber library: https://github.com/simplecto/sitemap_grabber/blob/029817ecadc1d9408e222e971e69eb5c99f7d9c7/sitemap_grabber/sitemap_grabber.py#L69

its output below from https://blockworks.co/price-sitemap-index.xml :

Image