pa11y / pa11y-ci

Pa11y CI is a CI-centric accessibility test runner, built using Pa11y
https://pa11y.org
GNU Lesser General Public License v3.0
520 stars 64 forks source link

Support for large sitemaps when sitemap.xml is using sitemap index files #194

Closed jamesmacwhite closed 11 months ago

jamesmacwhite commented 1 year ago

A website might output multiple sitemap index files as the sitemap.xml before providing URLs for each section. This is done with larger sites.

The structure for my example is roughly this:

https://developers.google.com/search/docs/crawling-indexing/sitemaps/large-sitemaps

Each section contains the URL data. It would appear pa11y-ci cannot parse the sitemap.xml as it is expecting URLs immediately. If I provide one of the section.xml paths, it works.

It would be good if pa11y-ci can parse a sitemap.xml that provides index files and go through each one.

danyalaytekin commented 12 months ago

Hi @jamesmacwhite. I've read up briefly about this, probably by following a Google Search road you've long since walked so please bear with me.

It appears that a sitemap index file shouldn't list sitemap index files, only sitemaps. The URL you provided also suggests that multiple sitemap indexes should each be submitted to the Search Console.

Nested sitemap indexes would also fail Google's validation, apparently. Is yours passing there? As a comment pointed out there, I realise it does complicate things a bit that Google's own sitemap index illegally (?) contains at least one other sitemap index.

Overall, is this the scenario you were asking to be supported?

jamesmacwhite commented 12 months ago

Hi.

It is a while since I posted, but what I was referring to was the fact the root sitemap.xml didn't contain URLs but links to other sitemap.xml files pointing to URLs. The live example of the site in question probably explains the setup in the clearest way:

https://www.nottinghamcollege.ac.uk/sitemaps-1-sitemap.xml

The main sitemap.xml (redirectst to the above) doesn't list URLs directly, it links to other sitemap.xml files per section which then in turn provide the URLs under the sections linked.

Hope that makes sense.

danyalaytekin commented 11 months ago

Hi @jamesmacwhite, ah I see now. Thanks for clarifying, and I realise it's been a while - sorry for this delay in following up.

I ran pa11y-ci just now against both sitemap.xml and the more direct sitemaps-1-sitemap.xml. It appeared to unroll the sitemap index to 1842 URLs in both cases, although I didn't complete the whole run:

$ pa11y-ci --sitemap https://www.nottinghamcollege.ac.uk/sitemap.xml

Running Pa11y on 1842 URLs:
 > https://www.nottinghamcollege.ac.uk/apply - 2 errors
 > https://www.nottinghamcollege.ac.uk/employers - 2 errors
 > https://www.nottinghamcollege.ac.uk/employers/apprenticeships - 2 errors

 ...

We do also have a test for the scenario: https://github.com/pa11y/pa11y-ci/blob/e7b7c17b4ec5fa5d3b52b539f15b520af470c0b2/test/integration/cli-sitemap.test.js#L91

Could you have been using a version of pa11y-ci older than 2.4?

jamesmacwhite commented 11 months ago

Thanks for this. I could have been, but glad it works!