syntax-tree / mdast-util-to-hast

utility to transform mdast to hast
https://unifiedjs.com
MIT License
101 stars 42 forks source link

Improve handling of tight lists #23

Closed Hamms closed 5 years ago

Hamms commented 6 years ago

Context

In markdown, "tight" lists are supposed to render text without paragraph tags:

- item one
- item two

becomes:

<ul>
<li>item one</li>
<li>item two</li>
</ul>

In remark, this usually works, but it works in kind of a weird way. The original text is actually parsed to an MDAST tree that contains paragraph nodes:

root[1] (1:1-2:11, 0-21)
└─ list[2] (1:1-2:11, 0-21) [ordered=false][loose=false]
   ├─ listItem[1] (1:1-1:11, 0-10) [loose=false]
   │  └─ paragraph[1] (1:3-1:11, 2-10)
   │     └─ text: "item one" (1:3-1:11, 2-10)
   └─ listItem[1] (2:1-2:11, 11-21) [loose=false]
      └─ paragraph[1] (2:3-2:11, 13-21)
         └─ text: "item two" (2:3-2:11, 13-21)

And those tags are then removed by mdast-util-to-hast:

https://github.com/syntax-tree/mdast-util-to-hast/blob/1f8591e0b6163c437999e842774f66c9027cfc47/lib/handlers/list-item.js#L18-L24

root[1] (1:1-2:11, 0-21)
└─ element[5] (1:1-2:11, 0-21) [tagName="ul"]
   ├─ text: "\n"
   ├─ element[1] (1:1-1:11, 0-10) [tagName="li"]
   │  └─ text: "item one" (1:3-1:11, 2-10)
   ├─ text: "\n"
   ├─ element[1] (2:1-2:11, 11-21) [tagName="li"]
   │  └─ text: "item two" (2:3-2:11, 13-21)
   └─ text: "\n"

The Problem

This usually results in correct HTML output, even though it's a bit weird to have "paragraph" nodes that don't then become "p" tags. However! This only works if the tight list item has a single child, which means that in cases like

- item one
  - item two

we would expect to render to

<ul>
<li>item one
<ul>
<li>item two</li>
</ul>
</li>
</ul>

but instead get

<ul>
<li>
<p>item1</p>
<ul>
<li>item2</li>
</ul>
</li>
</ul>

Because the node in the intermediate MDAST doesn't have just a single child:

root[1] (1:1-2:13, 0-23)
└─ list[1] (1:1-2:13, 0-23) [ordered=false][loose=false]
   └─ listItem[2] (1:1-2:13, 0-23) [loose=false]
      ├─ paragraph[1] (1:3-1:11, 2-10)
      │  └─ text: "item one" (1:3-1:11, 2-10)
      └─ list[1] (2:3-2:13, 13-23) [ordered=false][loose=false]
         └─ listItem[1] (2:3-2:13, 13-23) [loose=false]
            └─ paragraph[1] (2:5-2:13, 15-23)
               └─ text: "item two" (2:5-2:13, 15-23)

The Solution

There are a couple of potential solutions here; one of them would be to avoid creating those unnecessary "paragraph" nodes in the first place, but another is toupgrade mdast-util-to-hast to be able to deal with this case as well. This latter solution is outlined in this PR.

I still need to fix the tests that are affected by this change and write new ones for this new functionality, but before I invest that time I want to get your thoughts on the validity of pursuing this fix as opposed to the other one.

Hamms commented 6 years ago

@wooorm could I get your thoughts on the relative merits of the fix proposed here vs the possible fix for the same problem at the layer of the parser?

wooorm commented 6 years ago

@Hamms Interesting! I’m just now rewriting the mdast spec, and am finding that tight / loose on lists and on items are pretty messy. They’re idea is that they’re useful for HTML, mostly, but as we’re not really using it here, that’s just weird!

I saw that commonmark does not render the <p> in this case, so I guess that should be removed.

I think that the loose / tight problem should be fixed in remark-parse. I feel that it makes sense to make the logic here simpler, so that we could only check the loose property here

These are just half thoughts, what do based on my ideas?

Hamms commented 6 years ago

Yeah, I wasn't sure how to feel about whether the problem should be fixed at the level of "paragraph" nodes or "p" tags; seems like it comes down to the question of whether "paragraph" nodes have some use beyond just being blindly converted into "p" tags.

Fixing the problem here in the converter is saying that yes they do, and the handler (in this case, mdast-util-to-hast) should also be smart enough to know what tight lists are and know when they should and should not generate "p" tags.

Fixing it in remark-parse would be saying that no they do not; they just exist to be turned into "p" tags and so the parser should be smart enough to know when to and to know inject them into the tree.

I actually originally fixed this at the parser the layer, mostly because I didn't feel that I had a good sense for the role of the paragraph node in the greater remark ecosystem, and because the existing implementation puts that responsibility on mdast-util-to-hast and not on remark-parse, so I just went with the current direction. But if you're down with the idea of me changing things on a deeper level, I do I think prefer the "stupid" paragraph nodes philosophy.

Hamms commented 6 years ago

Upon further investigation, it looks like this solution actually introduces a problem with loose lists, because in the case of

- item one

- item two

remark-parse actually does not recognize the second item as being loose:

root[1] (1:1-3:11, 0-22)
└─ list[2] (1:1-3:11, 0-22) [ordered=false][loose=true]
   ├─ listItem[1] (1:1-2:1, 0-11) [loose=true]
   │  └─ paragraph[1] (1:3-1:11, 2-10)
   │     └─ text: "item one" (1:3-1:11, 2-10)
   └─ listItem[1] (3:1-3:11, 12-22) [loose=false]
      └─ paragraph[1] (3:3-3:11, 14-22)
         └─ text: "item two" (3:3-3:11, 14-22)
Hamms commented 6 years ago

Given that any list item being loose should make the whole list behave as loose - which is a "going backwards" approach that seems pretty inconsistent with remark-parse's entire premise - it seems like this fix will require changes to both the parse and transform stages.

First, parse should be updated to consistently create "paragraph" tags while still bubbling the loose flag up as it currently does. Then, transform should be updated to purge paragraph tags in loose lists as it currently does, but to also do so for list items that contain more than one child.

wooorm commented 6 years ago

@Hamms Maybe, but it’s based on commonmark! I dunno, I’m not sure of a better approach?

Hamms commented 6 years ago

I think it's possible to keep the approach consistent, it just requires embracing the idea of paragraph nodes as being something more semantically significant than just placeholders for p tags

wooorm commented 6 years ago

I agree with that idea, it’s what we’re doing now, and it works for me. It’s similar to the idea of a paragraph in HTML (not just a <p> tag either), in most cases <p> can be omitted.

Hamms commented 6 years ago

@wooorm Please take another look!

I believe I have the ultimate functionality where we want it to be; all the various ways of combining looseness, tightness, and nesting work as I expect them to. However, I'm not sure how I feel about the actual approach; rendering the results then modifying them like I'm doing feels inconsistent with the way subtrees are handled everywhere else in the codebase, but I'm not sure of a better way to do it. I'd love to hear your thoughts!

Thanks!

Hamms commented 6 years ago

@wooorm Could you take a look at this when you get a chance?

wooorm commented 5 years ago

Hey, sorry for the slow response! I thought about it some more, see https://github.com/remarkjs/remark/pull/364 for my current thinking. I think we can use the code here as a base on those changes in remark quite easily.

Hamms commented 5 years ago

Very exciting! I'll take a look as soon as I can