statiqdev / Statiq.Web

Statiq Web is a flexible static site generator written in .NET.
https://statiq.dev/web
Other
1.64k stars 234 forks source link

Prevent Data documents being included by default in Sitemaps #969

Closed Turnerj closed 2 years ago

Turnerj commented 2 years ago

I know this might seem a little extreme and maybe I'm missing something but it doesn't make sense to me why data files would be in a sitemap.

The core thing here is that sitemaps help search engines and crawlers to find web pages. While you might access the data via JS, search engines don't really care about. For example, the statiq.dev site itself exposes some YAML files via the sitemap file and they aren't even accessible. (eg. https://www.statiq.dev/news/topics/release.yml )

I first discovered this by accident when a tool I use reported that there were YAML files linked to on my site. I thought I broke something but the default behaviour seemed to include the files. While I disabled them individually in my case, it only dawned on me now that it didn't make sense to include them in the first place.


I could have it that it stays but the defaults are false just in case someone really wants it in their sitemap but I really don't see the point. Crawlers aren't exactly wanting to stumble across a YAML file.

daveaglick commented 2 years ago

I like what you're thinking here and totally agree it makes sense to exclude data files from the sitemap. But on the other hand, I think I still want to preserve the option to include them if wanted (who knows, might be a use case where that's helpful). Since the pipeline code here is checking WebKeys.IncludeInSitemap I wonder if a better approach that preserves the ability to toggle would be to set IncludeInSitemap to false by default for data documents. Then if someone wanted to they could toggle it back on per-document, per-folder (using folder metadata), or globally (using settings).

The trick here would be to remove the true default value in the filter, something like:

new ConcatDocuments(nameof(Data))
{
    new FilterDocuments(
        Config.FromDocument(WebKeys.ShouldOutput, true).CombineWith(
        Config.FromDocument(WebKeys.IncludeInSitemap)))
}

That way, for data documents if IncludeInSitemap is undefined (which it is by default), the value will be false and the CombineWith call will result in false and the document won't pass the filter.

Thoughts?

Turnerj commented 2 years ago

I like what you're thinking here and totally agree it makes sense to exclude data files from the sitemap.

Wooo!

But on the other hand, I think I still want to preserve the option to include them if wanted (who knows, might be a use case where that's helpful). Since the pipeline code here is checking WebKeys.IncludeInSitemap I wonder if a better approach that preserves the ability to toggle would be to set IncludeInSitemap to false by default for data documents. Then if someone wanted to they could toggle it back on per-document, per-folder (using folder metadata), or globally (using settings).

I thought that might be the case - I've updated the PR to default IncludeInSitemap to false.

With your specific example, I couldn't write that due to some signature conflicts with the code (even when I specified the generic parameter) but switching true for false makes the code compile.

My main reason for going for a removal first was primarily to reduce the maintenance burden as it seemed like a feature that few (possibly any) would want.

Anyway, I've reverted the original change, pushed a new commit and will update the PR title to reflect the change.

daveaglick commented 2 years ago

Perfect - this is a great quality of life improvement, thanks!