raviqqe / muffet

Fast website link checker in Go
MIT License
2.5k stars 97 forks source link

Allow application/xml and text/xml when parsing sitemaps #325

Closed nwidger closed 1 year ago

nwidger commented 1 year ago

Update (*sitemapPageParser).Parse to support both application/xml and text/xml as valid MIME types for sitemaps, previously only application/xml was considered valid. However, popular HTTP servers such as Caddy (https://caddyserver.com) return a Content-Type: text/xml; charset=utf-8 response header when serving a sitemap.xml sitemap file. Currently, it is not possible to use /sitemap.xml as the root page with muffet when serving a site with an HTTP server such as Caddy due to muffet rejecting the Content-Type header and thus refusing to parse the site's sitemap file.

The official website for the sitemap protocol (https://www.sitemaps.org) does not explictly state any specific Content-Type that should be used when serving sitemap files. However, the IANA's Media Types registry (https://www.iana.org/assignments/media-types/media-types.xhtml) lists both application/xml and text/xml as being valid media types for xml. Additionally, the abstract for RFC 7303 ("XML Media Types", https://www.rfc-editor.org/rfc/rfc7303.html), the reference listed in the IANA registry for both application/xml and text/xml media types, explicitly states that text/xml is an alias for application/xml:

This specification standardizes three media types -- application/xml,
application/xml-external-parsed-entity, and application/xml-dtd --
for use in exchanging network entities that are related to the
Extensible Markup Language (XML) while defining text/xml and text/
xml-external-parsed-entity as aliases for the respective application/
types.  This specification also standardizes the '+xml' suffix for
naming media types outside of these five types when those media types
represent XML MIME entities.

RFC 7303 Section 9.2 ("text/xml Registration") also states:

The registration information for text/xml is in all respects the same
as that given for application/xml above (Section 9.1), except that
the "Type name" is "text".

Hopefuly, this is enough to warrant allowing text/xml in addition to application/xml. A new TestSitemapPageParserParsePageMimeTypeAlias test has been added to ensure that (*sitemapPageParser).Parse doesn't return an error when its typ argument is text/xml.

nwidger commented 1 year ago

@raviqqe Sorry for not catching this earlier! We tested the changes you made for #316 running muffet against a preview version of our site served with hugo server, which is how we check for broken links in our CI pipeline. hugo server returns a Content-Type: application/xml; charset=utf-8 header when serving a sitemap.xml (I believe this is due to this check here in hugo's source code), thus in CI the new v2.9.1 release of muffet works great. However, the production site is served using Caddy, which as mentioned above returns a Content-Type: text/xml; charset=utf-8 header when serving a sitemap.xml file. That means that if we try to run muffet v2.9.1 against our production site, muffet returns with an root page has invalid content type error.

codecov[bot] commented 1 year ago

Codecov Report

Merging #325 (82e52af) into main (4536090) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #325   +/-   ##
=======================================
  Coverage   87.34%   87.34%           
=======================================
  Files          30       30           
  Lines         861      861           
=======================================
  Hits          752      752           
  Misses         88       88           
  Partials       21       21           
Impacted Files Coverage Δ
sitemap_page_parser.go 100.00% <100.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

raviqqe commented 1 year ago

Thanks for your contribution!

nwidger commented 1 year ago

@raviqqe Thank you for the quick merge & release! :+1: