mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
756 stars 138 forks source link

Leave callback not called for HTML block comment #202

Closed step- closed 5 months ago

step- commented 5 months ago

For the specific case of a one-line block comment (HTML block type 2) not followed by a blank line[1], MD4C fails to call the leave callback. Consequently, if another HTML comment follows the first comment, both end up being combined into a single block instead of staying as two separate blocks, each with its own callback.

[1] In my comments to issue https://github.com/mity/md4c/issues/200 I remarked that the spec doesn't require a blank line to close a type 2 HTML block.

DETAILS

I can't think of a simple way to demonstrate this issue using md2html alone, precisely because the lack of something can't be shown. Feeding md2html the following markdown:

<!-- C1 -->
<!-- C2 -->

outputs the input text so one would be inclined to think that everything is correct. However, what can't be seen is that there is no callback between lines C1 and C2, while there should be one. Instead, a callback after line C2 wrongly combines the two single-line blocks into a two-line block. Trace the code or add printf statements as needed to see the issue at work.

mity commented 5 months ago

Thanks for the report. If I understand correctly, the issue is that the implementation sort of merges two HTML blocks together if they follow one after another, and calls the enter_block just once before the 1st one, and leave_block after the last one.

However I wonder what your expectations really are and whether we need to fix this at all. Consider especially that the type 7 of HTML blocks may contain not one but whole set of HTML tags; and it may even be an invalid HTML chunk after the initial tag condition. All what matters is whether it satisfies the opening and closing conditions.

I.e. what the enter_block(MD_BLOCK_HTML) tells the renderer is: Whatever you get in the text_callback should be treated as a raw HTML, until you see leave_block(MD_BLOCK_HTML). However it does not generally break the HTML into individual tags and does not even attempt any HTML validation of it beyond the Markdown specification, and it wouldn't (at least for the type 7) even after "fixing" this.

step- commented 5 months ago

You understood correctly but enlarged the scope to all HTML block types. I'm only concerned about type 2 HTML blocks. Not interested in type 7. Indeed PR #203 specifically uses the value of html_block_type (2) to restrict the scope of my change.

step- commented 5 months ago

My expectation is that it gets fixed for type 2 HTML blocks because cmark renders every HTML block comment as a separate block, and the application I'm developing on top of MD4C relies on the same behavior.