statiqdev / Statiq.Framework

A flexible and extensible static content generation framework for .NET.
https://statiq.dev/framework
MIT License
421 stars 75 forks source link

Better support for trailing slash in links and custom link generation #218

Closed daveaglick closed 2 years ago

daveaglick commented 2 years ago

Discussed in https://github.com/statiqdev/Discussions/discussions/107

Originally posted by **zivkan** November 27, 2021 I decided on using `index.md` for my blog posts (and naming the folder the title of my blog post), so any post-specific content (images, etc) can stay in the same directory as the blog post. For example, I have URLs such as: https://domain.test/blog/my-first-post/ (this is index.html) https://domain.test/blog/my-first-post/image.png However, the document believes its canonical URL does not contain the trailing slash. Therefore, the canonical URL in the HTML header, the URL in the ATOM and RSS feeds, and links from the homepage all point to the "no slash" URL, which the web server then redirects to the slash. How do I configure statiq to generate URLs for document that retain the trailing `/` when the document is otherwise ending with `index.html`? edit: I just noticed on Google Webmaster Tools that I have a whole bunch of excluded pages of type "Alternative page with proper canonical tag". It appears that my pages are still getting indexed, but when I see errors or excluded pages, it makes me a bit worried, so I really would like to fix this.
daveaglick commented 2 years ago

Interesting question! I think canonically a trailing slash indicates that directory (I.e. the index.html file, which is how you're using it). In general, I usually see static hosts "hide" the index page so that URLs appear to be rooted at the parent folder path. This can actually cause some problems of its own when file names and folder names collide. But it sounds like your issue is different.

How do I configure statiq to generate URLs for document that retain the trailing / when the document is otherwise ending with index.html

This is probably the heart of the question. Statiq contains a setting LinkHideIndexPages that toggles the hiding of the entire index.html file (going up to the root folder without a slash), but there's no built-in way to both hide the index page but also leave the trailing slash.

I see two options (or both) going forward:

I don't think either one will be too complicated, assuming I'm understanding the requirement correctly. Let me see what I can do.

zivkan commented 2 years ago

In case it helps, here's the "high level" (customer requirement) of why I'm trying to use statiq in the way that I am:

I was trying to decide where to keep assets (images, page specific css/js, etc) in the filesystem. So, rather than having:

input\blog\post1.md
input\blog\post2.md
input\images\post1-screenshot.png

I prefer to use

input\blog\post1\index.md
input\blog\post1\screenshot.png
input\blog\post2\index.md

This way "everything" related to a single post is in a single directory.

Now when linking images, unless I use the absolute URL, the image urls are relative. Therefore, it really matters if the web server ends up using a trailing slash or not, as a relative URL for either of these scenarios doesn't work for the other (trailing slash or not). For example, <img src="screenshot.png /> isn't going to work if the web server returns /index.html, rather than HTTP redirect, when the browser URL doesn't contain the trailing slash.

Finally, as mentioned in the edit of my original post, web crawlers don't appear to be super happy with web pages where the canonical URL redirects to another URL. Fortunately, it doesn't seem to cause problems, but it does spam the warning list, making it harder to find other warnings.

FWIW, I tried reconfiguring my web server to stop redirecting to a trailing slash, but that made image relative urls less nice, so I decided I like the trailing slash.

daveaglick commented 2 years ago

Thanks for the additional details. This is turning out to be a little more challenging than I thought at first.

At issue is that the LinkGenerator takes a NormalizedPath, and NormalizedPath is designed to strip off trailing slashes before we even get to the link generation. This mostly makes sense given that a "path" wouldn't normally have a trailing slash, but URLs are a special thing. It's also not entirely relevant for preserving/adding a trailing slash when hiding known files like index.html, but is important for a holistic approach to trailing slashes in the link generator. I think this has revealed a bigger problem in that Statiq just doesn't like trailing slashes at all in the link generator. That's an issue since trailing slashes are definitely a thing in URLs and should be supported.

Not quite sure how I'm going to resolve it yet, though it'll probably be fairly low-level. I might end up having the link generator take a string for the path instead of a NormalizedPath - that probably makes the most sense given that a "path" in a URL is not the same as a file system style path that NormalizedPath is intended to represent. Just thinking out loud at this point...

daveaglick commented 2 years ago

This is now done and will ship with the next release. It's exposed as a global setting via Keys.LinkHiddenPageTrailingSlash and through GetLink() as a parameter and LinkGeneratorShortcode as a shortcode argument.