ocaml / omd

extensible Markdown library and tool in "pure OCaml"
ISC License
156 stars 45 forks source link

unicode link label normalization (fix test 539) #277

Closed tatchi closed 2 years ago

tatchi commented 2 years ago

Input:

[ẞ]

[SS]: /url

This is the result in master:

File "tests/spec-539.html", line 1, characters 0-0:
diff --git a/_build/default/tests/spec-539.html b/_build/default/tests/spec-539.html.new
index c31102f..4e28a6f 100644
--- a/_build/default/tests/spec-539.html
+++ b/_build/default/tests/spec-539.html.new
@@ -1 +1 @@
-<p><a href="/url">ẞ</a></p>
+<p>[ẞ]</p>
make: *** [test] Error 1   

The issue is that both labels are not being matched, hence is it not recognized as a link. To match labels, we need to normalize them (strip off leading/trailing whitespace, ...) and do a case-insensitive comparison. The unicode version of that is a bit more complex as we need to do a Unicode case folding. From the spec:

One label matches another just in case their normalized forms are equal. To normalize a label, strip off the opening and closing brackets, perform the Unicode case fold, strip leading and trailing spaces, tabs, and line endings, and collapse consecutive internal spaces, tabs, and line endings to a single space.

This PR adapts the normalize function to work with unicode labels too. Fortunately, I could rely on some libs (uutf, uucp, and uunf) and I even found a piece of code in the doc that does almost what's needed.

With that adapted normalize function, and SS are matched. The result is now a link as expected.

tatchi commented 2 years ago

Thanks for the review @shonfeder, I added extra tests to cover the normalization of labels in https://github.com/ocaml/omd/pull/277/commits/49d3ca2a12583d0804052ab40406b916081a7ebc