Open nathanlesage opened 1 year ago
So first, for the code sandbox:
Frontmatter starts at the first line. You can test this on GH: adding blank lines before the YAML doesn’t work. Start with:
const sourceMarkdown = `---
…and it works.
It breaks when that said YAML frontmatter uses Jekyll-style delimiters (three dashes both on top and on the bottom of the frontmatter)
What doesn’t work? I believe it works. That’s the default. That’s what 1.6m people per month use the plugin for. And it’s well tested
Here’s a reduced test case:
import {fromMarkdown} from 'mdast-util-from-markdown'
import {frontmatter} from 'micromark-extension-frontmatter'
import {frontmatterFromMarkdown} from 'mdast-util-frontmatter'
const sourceMarkdown = `---
title: a
* b
`;
const tree = fromMarkdown(sourceMarkdown, {
extensions: [frontmatter()],
mdastExtensions: [frontmatterFromMarkdown()]
})
console.log(tree) // Contains a paragraph with `* b` as text, not a list.
The reason is that this looks a lot like frontmatter. So it continues on checking all those lines for the ending. When doing that, it prevents the lists from being checked. Afterwards, it can’t do those lists again.
Hi! This was marked as ready to be worked on! Note that while this is ready to be worked on, nothing is said about priority: it may take a while for this to be solved.
Is this something you can and want to work on?
Team: please use the area/*
(to describe the scope of the change), platform/*
(if this is related to a specific one), and semver/*
and type/*
labels to annotate this. If this is first-timers friendly, add good first issue
and if this could use help, add help wanted
.
Thank you for your swift reply! You are right that I accidentally added a newline there. However, when I remove that, the list is still not properly being detected. Here's a screenshot from the updated sandbox:
What doesn’t work? I believe it works. That’s the default. That’s what 1.6m people per month use the plugin for. And it’s well tested
I am sorry if I came across as ungrateful or anything — I am a big fan of your work and appreciate everything you do. I did some extensive checking this morning to find the bug, and ended up finding this oddity. I am, however, not sure what kind of interaction effect is (I believe that each plugin itself works fine, absolutely, but since we have this weird bug, I suspect some interaction – that's also why I'm not sure if this emanates from here, or from the remark
plugin that wraps this one…? I am thankful for any guidance in this regard.)
Regarding your reduced case, what is the intention in leaving out the ending fence of the frontmatter here? In other words, why did you not write the following?
---
title: a
---
* b
One additional thing I noted, but which I believe can be fixed when this bug is sorted out, is that if the list is not being properly detected, the parser "eats" the indentation spaces in front of the second-order list item. You may note that there is this worngly spelled word
in the indentation. The reason is that I'm using the syntax tree to run a spell checker, and I noted that the spaces in front of the list just get swallowed, shifting the offsets of the text nodes into a wrong position by that amount. I did not initially report that because I believe it is just a fallout effect of whatever this is.
I greatly appreciate any guidance in this regard. Thank you!
am sorry if I came across as ungrateful or anything
No need, no worries, you didn’t come across as such. See my later response for a reduced case. You are correct, there is a bug.
Regarding your reduced case, what is the intention in leaving out the ending fence of the frontmatter here? In other words, why did you not write the following?
Because that’s exactly what’s going on here: if an opening but no closing fence is found, lists don’t work. You had ---
and ...
in your first case. Which, when starting with ---
but without an ending of ...
, shows that exact bug. But your second case (start and end ---
) masked what was happening. Because some frontmatter was parsed.
For the second bug, whitespace being eaten, I think it’s unrelated to this, probably. It’s because whitespace in paragraphs is removed. E.g.:
asd
whatever
That positional info might be lost somewhere though. As the text node is asd\nwhatever
for that. But also this markdown would result in the exact same text node:
> * asd
> whatever
Maybe some positional info is lost somewhere
Ahhhh, I see what you mean, yes! Thank you for the clarification!
Hey @wooorm, just as an update: I have now figured out that the fault in the options also prevents the Gfm parser (remark-gfm
) from detecting footnotes (possibly other elements as well). It works fine if I just comment out the corresponding config option { type: 'yaml', fence: { open: '---', close: '...' } }
.
Yep, block quotes, lists, and footnote definitions! All “containers” that fail on this :( Will be maybe a bit less than 2 months before I can get to it unfortunately.
Hey, thanks for your quick reply! I can absolutely understand if you can't spend time on it right now. I would be happy to prepare a PR, since I think I found the cause, but it would be ideal if you could point me to a few resources so that I can make the fix work properly and stick to the unified conventions.
Here's what I found:
Because both frontmatters (The Pandoc style – --- / ...
– and the jekyll-style – --- / ---
) start with the same character code (45), the tokenizer layout looks like this:
{
flow: {
'45': [
tokenizer, // Parses --- / ...
tokenizer // Parses --- / ---
]
}
}
This makes the engine go through the document twice, once for the --- / ...
frontmatter and then for the --- / ---
frontmatter type. The first cannot find the ...
ending fence and hence parses the full document as a frontmatter value, thereby disturbing the other parsers.
There are two possible solutions:
If the engine allows one to abort parsing anything if the given frontmatter type is not detected, then we could add a break condition so that the parser that doesn't fit does not attempt to detect anything.
The other solution would be to merge the openings and closings, i.e., allow for an array of strings in the open/close options that configure the plugin. I've seen that the buffer
variable is being used to detect a frontmatter fence, I could rewrite the code so that it accepts an array and just tests for every possibility. Although, since the plugins themselves actually work perfectly fine, if I just add a single frontmatter definition, it might be much cleaner to add the aforementioned "break condition".
What are your thoughts? I can get a PR ready for review today so that it's fixed!
See my small reproduction above: https://github.com/micromark/micromark-extension-frontmatter/issues/5#issuecomment-1381972189. This issue exists for normal frontmatter. A single format. ---\n* a
.
I think it’s going to get super complex: the engine does not really support aborting. Because normal markdown doesn’t have that: once there’s a line that opens fenced code, anything after it is fenced code, for example.
And you can’t have a list item in those.
Same here. The YAML is open. You can’t have a list item in it.
But it never closed. And we don’t have list items.
So micromark
needs an architectural overhaul to solve.
Ouh … alright. This explains why I wasn't getting anywhere with my attempts at fixing in this regard … :/
Initial checklist
Affected packages and versions
micromark-extension-frontmatter v1.0.0
Link to runnable example
https://codesandbox.io/s/hopeful-matan-nybskz?file=%2Fsrc%2Findex.js
Steps to reproduce
I have a workload where I need to transform Markdown documents that can also include frontmatters to a syntax tree, using
parse()
on the remark processor.It breaks when that said YAML frontmatter uses Jekyll-style delimiters (three dashes both on top and on the bottom of the frontmatter). It works, however, when I use Pandoc-style delimiters (three dashes on top, three dots on the bottom). Usage of the plugin then disables the processor's ability to detect lists properly.
Here are the steps to reproduce. Above I have linked a sandbox that is already properly set up. In there, simply exchange the end-fence for the frontmatter between dashes and dots and observe the difference in produced syntax trees.
Processor setup
Note that while I am being verbose in explicitly stating the delimiters, exchanging the second definition to simply
yaml
does not change the effect.Test document
Use this document and run it through said pipeline. You will notice that, when you use three dashes to end the frontmatter, the list is not correctly detected, whereas, when you exchange that with three dots (Pandoc-style frontmatter), it will be correctly detected.
Expected behavior
Correct behavior with Pandoc-style frontmatter
If you exchange the three dashes with three dots in the test-document, it works as expected.
This is the correct Syntax Tree:
Actual behavior
Incorrect behavior with Jekyll-style frontmatter
If you just run the above test document (with three dashes), it produces a wrong syntax tree. Observe how it does not detect the list properly.
Runtime
Node v16, Other (please specify in steps to reproduce)
Package manager
yarn v1
OS
macOS
Build and bundle tools
Webpack