ocaml / omd

extensible Markdown library and tool in "pure OCaml"
ISC License
156 stars 45 forks source link

automatically fill the id attributes of the headings #267

Closed tatchi closed 1 year ago

tatchi commented 2 years ago

Fixes: #251

This is my attempt at implementing #251

I didn't add an option to enable/disable that feature. Let me know if I should.

shonfeder commented 2 years ago

Ah, I just looked at the failures in CI. There are two issues:

  1. To maintain compatibility with OCaml 4.04.2, we'll either need to avoid using the Fun module and write our implementations, or pull in https://github.com/thierry-martinez/stdcompat (or an equivalent, if available). I'm quite happy with doing the latter.
  2. The failures on recent version of OCaml show why this actually does have to be configurable: it breaks all of our specification compliance tests :). I suggest we make this configurable by CLI, by default to having it enabled.

How does that sound?

tatchi commented 2 years ago
  1. To maintain compatibility with OCaml 4.04.2, we'll either need to avoid using the Fun module and write our implementations, or pull in https://github.com/thierry-martinez/stdcompat (or an equivalent, if available). I'm quite happy with doing the latter.

I just inlined the function as it was straightforward to do: https://github.com/ocaml/omd/pull/267/commits/0570385a3f8da16ed65929c7b4c983803a5e4d68

  1. The failures on recent version of OCaml show why this actually does have to be configurable: it breaks all of our specification compliance tests :). I suggest we make this configurable by CLI, by default to having it enabled.

I added an --auto-identifiers option (inspired by pandoc) in https://github.com/ocaml/omd/pull/267/commits/cb088d1606baf9a9956c7bcd07f2ad8aa1d6bb54. Since it's enabled by default it still break the tests. Should I promote them or disable that option when running the tests?

As for the identifiers, there's actually some rules described in the doc on how to derive them:

The default algorithm used to derive the identifier from the heading text is:

  • Remove all formatting, links, etc.
  • Remove all footnotes.
  • Remove all non-alphanumeric characters, except underscores, hyphens, and periods.
  • Replace all spaces and newlines with hyphens.
  • Convert all alphabetic characters to lowercase.
  • Remove everything up to the first letter (identifiers may not begin with a number or punctuation mark).
  • If nothing is left after this, use the identifier section.

I had to switch to an implementation that supports Unicode characters in https://github.com/ocaml/omd/pull/267/commits/eec7388c4dfa0c5b9b426cc09c58adc8498baa6e (using the Uucp and Uutf modules). It's not yet perfect as some of the test results are not correct. I'll continue to try to improve that :)

tatchi commented 2 years ago

I improved the algorithm slightly; it should be ok for a second review.

There are still some differences compared to the result I get on https://pandoc.org/try/, but pandoc can only generate id's when using markdown and not commonmark. So the outputted HTML content (from which the id is derived) might be different in some cases, hence why the result might differ.

The only thing missing I think is:

several headings have the same text; in this case, the first will get an identifier as described above; the second will get the same identifier with -1 appended; the third with -2; and so on.

So for instance in pandoc, the following

### hello
### hello

becomes:

<h3 id="hello">hello</h3>
<h3 id="hello-1">hello</h3>

I guess it means that we'll need to keep a state somewhere with all the ids we've already generated. Not sure how to best implement that and if that's something we want. Would love to have your input on that :)

We'll also have to decide how to handle "empty" id. In pandoc:

If nothing is left after this, use the identifier section.

So

### 33
### 22
### 

becomes

<h3 id="section">33</h3>
<h3 id="section-1">22</h3>
<h3 id="section-2"></h3>
tmattio commented 2 years ago

cc @koonwen, this would probably fix https://github.com/ocaml/ocaml.org/issues/523!

shonfeder commented 2 years ago

We'll also have to decide how to handle "empty" id. In pandoc:

I'm fine with following pandoc here.

shonfeder commented 1 year ago

Closing this in favor of #294. I opened up the new branch purely for the purposes of making merge conflict resolution easier.

Thanks so much, @tatchi!