Crawler ignoring `base` tag

p3lim commented 3 years ago

When crawling a website and finding RSS links, the <base> element should be taken into account, if present.

Example: https://www.mmo-champion.com/ redirects to the /content/ path, and the <link rel="alternate" ...> is a relative link, but the site also has <base href="https://www.mmo-champion.com/" /> which should be taken into account. Without this, attempting to add the feed will fail.

p3lim commented 3 years ago

They added a / prefix so it's not relative any more, so it can't be used for testing, but the issue should still be considered for other sites possibly presenting the same structure.

martinetd commented 3 years ago

If you need another example, megatokyo.com has the same issue for its comics e.g. https://megatokyo.com/strip/1586 has a relative link to strips/1586.png but there's a <base href="https://megatokyo.com/"> (feed url: https://megatokyo.com/rss/megatokyo.xml )

Looking at the code a bit it's not trivial to fix, sanitizer.Sanitize only takes an URL so we could pass it the resolved base instead but the processor doesn't have any straighforward way of getting the base url back from the scrapper; it might be worth sanitizing once in the scrapper and resanitizing at the end for safety? (FWIW, https://github.com/PuerkitoBio/gocrawl/pull/46/files#diff-a693fca73f07436af23c207f04d5a5b7L362 gives an example of respecting base url with goquery)

miniflux / v2

Crawler ignoring `base` tag #757