Open mafr opened 6 months ago
Hi Matthias. I can't reproduce this. While I normally use a trailing slash in these cases (e.g., {slug}/
), even when I set PAGE_URL
to {slug}
, there are no .html
suffixes appended in the generated sitemap.
I suggest using the pelican-quickstart
command to generate a new site with completely stock settings, theme, etc. Then add the plugin to that site and see if you can reproduce the issue you are seeing. If you can't reproduce it, then you know something else is awry in your other site configuration/environment.
Hi Justin, thank you so much for your response! I was able to reproduce this in a minimal setup: https://github.com/mafr/sitemap-bug
This is right out of pelican-quickstart
, I only changed ARTICLE_URL
and added one article. The generated sitemap contains .html
suffixes, e.g. <loc>https://blog.mafr.de/first-article.html</loc>
while the rendered index.html
contains a link without the suffix.
Thank you for taking the time to post your test site in a visible location. That allowed me to see where your configuration differs from mine.
Try adding the following line to the end of your configuration file:
ARTICLE_SAVE_AS = '{slug}/index.html'
The *_URL
settings dictate how links are generated, whereas the *_SAVE_AS
settings control the location and file names of the generated files. The Sitemap plugin relies on the latter when determining what should appear within the <loc> … </loc>
tags.
Much appreciated! The winning combination are these settings:
ARTICLE_URL = '{slug}/'
ARTICLE_SAVE_AS = '{slug}/index.html
That means in order to get a working sitemap on Cloudflare Pages, I have to override all the *_URL
and *_SAVE_AS
settings and change my relative links. Some files are theme-dependent and can't always be influenced (authors.html
etc.), but I can work around that by creating my own theme.
Maybe it would be better if the plugin used the *_URL
settings -- after all, the sitemap URLs are links. That would simplify things for users and prevent us from breaking the sitemap without noticing.
I believe the aforementioned behavior was introduced as part of the refactoring endeavor in #3, in which the pathname2url
function is used to infer the URL from the file path. As for why, I imagine that's answered in the first bullet point in that PR:
Act on every
content_written
signal to avoid guessing what pages to cover.
… but perhaps @kernc may be able to shed more light on the subject?
I was able to solve the article url in the sitemap by using:
ARTICLE_URL = '{slug}/'
ARTICLE_SAVE_AS = '{slug}/index.html
However, this does not work for tag, category, author, those are still showing as .html in sitemap and are reported as broken on Google console. Any advice would be appreciated!
@mrugges Tags, categories, authors have separate *_URL
and *_SAVE_AS
!
in which the
pathname2url
function is used to infer the URL from the file path
That's correct. The to_url()
can easily be amended, but to what?
if not ARTICLE_URL.endswith('/') and path.endswith('.html'):
path = path.removesuffix('.html')
???
When changing
ARTICLE_URL
orPAGE_URL
to{slug}
in Pelican's configuration, the generated sitemap still contains URLs with an.html
suffix.Some background: When hosting a static site on Cloudflare Pages, the platform automatically adds redirects from
/some-page.html
to/some-page
. Even though the site technically works with{slug}.html
, Google's Search Console rejects the alll URLs in the sitemap that return redirects.