remarkjs / remark

markdown processor powered by plugins part of the @unifiedjs collective
https://remark.js.org
MIT License
7.66k stars 358 forks source link

[remark-parse] Ordered lists are not recognized if they both use leading zeroes and interrupt a block #1242

Closed benblank closed 1 year ago

benblank commented 1 year ago

Initial checklist

Affected packages and versions

remark-parse@11.0.0

Link to runnable example

No response

Steps to reproduce

  1. In a new folder, create a new Node module by running e.g. pnpm init.

  2. Run pnpm install remark-parse@11.0.0.

  3. Run pnpm install unified@11.0.3.

  4. Save the code below as repro.mjs file and run node repro.mjs. (I used Node v18.17.0.)

    This will generate a JSON file containing the parsed AST (sans position properties, so that they can be easily diffed) for each of the Markdown snippets it contains.

    repro.mjs ``` js import { writeFile } from "node:fs/promises"; import remarkParse from "remark-parse"; import { unified } from "unified"; const parser = unified().use(remarkParse).freeze(); const documents = { noLeadingZeroesFollowing: `The preceeding paragraph. 1. one 4. two `, leadingZeroesFollowing: `The preceeding paragraph. 01. one 02. two `, noLeadingZeroesInterrupting: `The preceeding paragraph. 1. one 2. two `, leadingZeroesInterrupting: `The preceeding paragraph. 01. one 02. two `, }; function stripPositions(node) { const { position, children, ...rest } = node; return { ...rest, children: children?.map(stripPositions) }; } await Promise.all( Object.entries(documents).map(([name, text]) => writeFile( name + ".json", JSON.stringify(stripPositions(parser.parse(text)), undefined, 2), ), ), ); ```
  5. Observe that the files noLeadingZeroesFollowing.json, leadingZeroesFollowing.json, and noLeadingZeroesInterrupting.json are identical and that their root nodes contain both a paragraph node and a list node. However, the root node in leadingZeroesInterrupting.json instead contains only a single paragraph node. Diffing it against any of the other files will produce output similar to the following.

    repro.diff ``` diff --- noLeadingZeroesFollowing.json 2023-10-13 16:11:26.261286672 -0700 +++ leadingZeroesInterrupting.json 2023-10-13 16:11:26.261286672 -0700 @@ -6,47 +6,7 @@ "children": [ { "type": "text", - "value": "The preceeding paragraph." - } - ] - }, - { - "type": "list", - "ordered": true, - "start": 1, - "spread": false, - "children": [ - { - "type": "listItem", - "spread": false, - "checked": null, - "children": [ - { - "type": "paragraph", - "children": [ - { - "type": "text", - "value": "one" - } - ] - } - ] - }, - { - "type": "listItem", - "spread": false, - "checked": null, - "children": [ - { - "type": "paragraph", - "children": [ - { - "type": "text", - "value": "two" - } - ] - } - ] + "value": "The preceeding paragraph.\n01. one\n02. two" } ] } ```

Expected behavior

Ordered lists should be parsed consistently, regardless of whether their list markers have leading zeroes or the list interrupts a block.

Actual behavior

Ordered lists are recognized as such if their list markers have leading zeroes or they interrupt a block. However, ordered lists are not recognized as such if their list markers have leading zeroes and they interrupt a block.

Runtime

Other (please specify in steps to reproduce)

Package manager

pnpm

OS

Linux

Build and bundle tools

Other (please specify in steps to reproduce)

benblank commented 1 year ago

Apologies for not providing a runnable example, but I spent more time trying (and failing) to get codesandbox to do something useful than I did on the rest of the report. 😅

ChristianMurphy commented 1 year ago

Thanks @benblank! Here is the repro in a sandbox https://stackblitz.com/edit/node-mneiet?file=index.js I'm seeing the same behavior you describe when running remark 15.0.1

Checking the four examples in CommonMark Dingus

  1. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A%0A1.%20one%0A4.%20two
  2. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A%0A01.%20one%0A02.%20two
  3. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A1.%20one%0A2.%20two
  4. https://spec.commonmark.org/dingus/?text=The%20preceeding%20paragraph.%0A01.%20one%0A02.%20two

It does indeed appear all four should produce a list


Tracing further. I suspect the issue is down one level in micromark, I'm able to replicate the issue without having the AST generated https://stackblitz.com/edit/node-1ygk3h?file=index.js

github-actions[bot] commented 1 year ago

Hi! This was marked as ready to be worked on! Note that while this is ready to be worked on, nothing is said about priority: it may take a while for this to be solved.

Is this something you can and want to work on?

Team: please use the area/* (to describe the scope of the change), platform/* (if this is related to a specific one), and semver/* and type/* labels to annotate this. If this is first-timers friendly, add good first issue and if this could use help, add help wanted.

benblank commented 1 year ago

I suspect the issue is down one level in micromark, I'm able to replicate the issue without having the AST generated

Ah! Dang. I'd traced it this far down from Prettier and thought I'd gotten to the bottom of it. 🙂

Thanks for all the helpful links!

wooorm commented 1 year ago

I do think the spec is unclear for this:

In order to solve of unwanted lists in paragraphs with hard-wrapped numerals, we allow only lists starting with `1` to interrupt paragraphs. Thus,~

(right above example 304). As in, I followed those words here.

I think that the current behavior is in line with the reasoning there. Natural language phrases might include 1., but 2. or 01. are more unlikely.

wooorm commented 1 year ago

If you care strongly about this, could you perhaps open an issue with commonmark/commonmark-spec to check what the idea is?

benblank commented 1 year ago

Actually, I missed that when I was reading through the spec. I'm not sure I 100% agree with the reasoning behind it, but those reasons do at least appear to be pretty clear.

I may indeed open up an issue with regards to the phasing, though; I feel the section you quoted would be improved by calling out that it's only referring to ordered lists and to the markers 1. and 1) (not the character 1), even if there are examples demonstrating both cases. The emphasis on the principle of uniformity also suggests that the exception applies to nested lists as well, but I don't see text or an example calling that out.

I also have to admit to being a bit surprised to see "interrupting, not starting with 1" called out as not being valid, simply because when I was checking BabelMark, a large number of the parsers (including nine of the twelve marked as specifically targeting CommonMark) considered it valid.

On the one hand, it's a shame to "disagree" with so many other implementations, but the spec is clear as to what the Right Thing is, and it isn't what I was trying to do. I'll go ahead and close the issue.

Thanks for taking the time to look into this!

github-actions[bot] commented 1 year ago

Hi! This was closed. Team: If this was fixed, please add phase/solved. Otherwise, please add one of the no/* labels.

github-actions[bot] commented 1 year ago

Hi team! Could you describe why this has been marked as wontfix?

Thanks, — bb

github-actions[bot] commented 1 year ago

Hi team! I don’t know what’s up as there’s no phase label. Please add one so I know where it’s at.

Thanks, — bb

wooorm commented 1 year ago

There’s a wide variety of parser that all do things differently. CM likes to be ambiguous on all the edge cases. This also comes as a given when it’s mostly a test suite of input/output examples, and not an explanation of an algorithm (such as HTML). I’d like a more formal spec. But I can see value in this too. Anyway, feel free to PR to the spec another example of the 01 case. Then I (and others) will go with the one that’s decided for that!