statiqdev / Statiq.Framework

A flexible and extensible static content generation framework for .NET.
https://statiq.dev/framework
MIT License
421 stars 74 forks source link

Front Matter Extraction "Wrong" #264

Open NikoMix opened 1 year ago

NikoMix commented 1 year ago

This is partly a bug, partly a discussion about how the intended behavior should be.

Issue, I have a markdown file with a "----" somewhere in the middle of the document, which however never get's terminated. It's "just" a logical separation of content not at all affiliated with YAML Syntax. However the "ExtractFrontMatter"-Module kick's in an start's doing it's thing, handing over to Statiq.Yaml which crashes at ParseYaml due to "Did not find expected < document end >.", I do understand this "issue" is within the YamlDotNet Library, which however shouldn't ever be called in the first place, as the content is no YAML (markdown).

Technically there should not be any restriction and as such no content omitted from Markdown if somewhere in the middle of the content "---" is contained. Based on Jekyll's description "[...] The front matter must be the first thing [...]" so before prosing a fix, I'd like to understand what should be the intended behavior as this would be a potential breaking change.

In the Unit Test Project however I find many tests, which are expecting Front Matter being expected in the middle of the document and thus modifying the output. Thus the confusion.

daveaglick commented 1 year ago

This question gets really tricky and it's one of the areas I've probably spent the most time going back and forth on. Because there are no standards regarding front matter delimiting, all we're left with is the conventions that other generators take. I've generally found two patterns for generators that accept YAML front matter. They both include a trailing ---, but the preceding first-line --- appears to be a little less agreed on while some generators like Jekyll require it and others don't. I erred on the side of compatibility so Statiq supports both styles, though as you noted that means a single-line --- elsewhere will mean "everything above this is front matter," even if that's not the intent. The Statiq case is even trickier because Statiq supports any front matter format in theory, and front matter in any kind of file, so the delimiter style has to be pluggable (I.e. JSON front matter in a C# script is going to require totally different kinds of delimiting).

All this is to say you've found a known edge case when a front matter delimiter is used further down in a file. This is indeed different than Jekyll, but intentionally so (while Statiq aims for some measure of compatibility to make porting easier, it's not a "Jekyll-compatible" generator and that's not a goal of the project).

The easiest way to handle this situations if you know you're always going to be using Jekyll-style front matter delimiters that include a first-line --- is to modify the FrontMatterRegexes so that a first-list delimiter is required.

The default FrontMatterRegexes setting includes this regex: \A(?:^\r*-+[^\S\n]*$\r?\n)?(.*?)(?:^\r*-+[^\S\n]*$(\r?\n)?) which you can see matches both with and without the initial ---:

image

image

So if you want to only match when a starting --- is present, you can adjust the regex to \A(?:^\r*-+[^\S\n]*$\r?\n)(.*?)(?:^\r*-+[^\S\n]*$(\r?\n)?):

image

image

image

This can be changed like this:

await Bootstrapper.Factory
    .CreateDefault(args)
    .AddSetting(
        WebKeys.FrontMatterRegexes,
        new[] { @"\A(?:^\r*-+[^\S\n]*$\r?\n)(.*?)(?:^\r*-+[^\S\n]*$(\r?\n)?)" })
    // ...
    .RunAsync();

Let me know if that resolves the issue for you.