pelican-plugins / sitemap

Generates a site map for Pelican-powered sites
49 stars 10 forks source link

Invalid URLs generated #32

Open mafr opened 6 months ago

mafr commented 6 months ago

When changing ARTICLE_URL or PAGE_URL to {slug} in Pelican's configuration, the generated sitemap still contains URLs with an .html suffix.

Some background: When hosting a static site on Cloudflare Pages, the platform automatically adds redirects from /some-page.html to /some-page. Even though the site technically works with {slug}.html, Google's Search Console rejects the alll URLs in the sitemap that return redirects.

justinmayer commented 6 months ago

Hi Matthias. I can't reproduce this. While I normally use a trailing slash in these cases (e.g., {slug}/), even when I set PAGE_URL to {slug}, there are no .html suffixes appended in the generated sitemap.

I suggest using the pelican-quickstart command to generate a new site with completely stock settings, theme, etc. Then add the plugin to that site and see if you can reproduce the issue you are seeing. If you can't reproduce it, then you know something else is awry in your other site configuration/environment.

mafr commented 6 months ago

Hi Justin, thank you so much for your response! I was able to reproduce this in a minimal setup: https://github.com/mafr/sitemap-bug

This is right out of pelican-quickstart, I only changed ARTICLE_URL and added one article. The generated sitemap contains .html suffixes, e.g. <loc>https://blog.mafr.de/first-article.html</loc> while the rendered index.html contains a link without the suffix.

justinmayer commented 6 months ago

Thank you for taking the time to post your test site in a visible location. That allowed me to see where your configuration differs from mine.

Try adding the following line to the end of your configuration file:

ARTICLE_SAVE_AS = '{slug}/index.html'

The *_URL settings dictate how links are generated, whereas the *_SAVE_AS settings control the location and file names of the generated files. The Sitemap plugin relies on the latter when determining what should appear within the <loc> … </loc> tags.

mafr commented 6 months ago

Much appreciated! The winning combination are these settings:

ARTICLE_URL = '{slug}/'
ARTICLE_SAVE_AS = '{slug}/index.html

That means in order to get a working sitemap on Cloudflare Pages, I have to override all the *_URL and *_SAVE_AS settings and change my relative links. Some files are theme-dependent and can't always be influenced (authors.html etc.), but I can work around that by creating my own theme.

Maybe it would be better if the plugin used the *_URL settings -- after all, the sitemap URLs are links. That would simplify things for users and prevent us from breaking the sitemap without noticing.

justinmayer commented 6 months ago

I believe the aforementioned behavior was introduced as part of the refactoring endeavor in #3, in which the pathname2url function is used to infer the URL from the file path. As for why, I imagine that's answered in the first bullet point in that PR:

Act on every content_written signal to avoid guessing what pages to cover.

… but perhaps @kernc may be able to shed more light on the subject?

mrugges commented 3 months ago

I was able to solve the article url in the sitemap by using:

ARTICLE_URL = '{slug}/'
ARTICLE_SAVE_AS = '{slug}/index.html

However, this does not work for tag, category, author, those are still showing as .html in sitemap and are reported as broken on Google console. Any advice would be appreciated!

kernc commented 3 months ago

@mrugges Tags, categories, authors have separate *_URL and *_SAVE_AS!


in which the pathname2url function is used to infer the URL from the file path

That's correct. The to_url() can easily be amended, but to what?

if not ARTICLE_URL.endswith('/') and path.endswith('.html'):
    path = path.removesuffix('.html')

???