miyuchina / mistletoe

A fast, extensible and spec-compliant Markdown parser in pure Python.
MIT License
841 stars 119 forks source link

heading token content is always the last heading #212

Open clehene opened 9 months ago

clehene commented 9 months ago

The content for block_token.Heading gets har

import mistletoe
from mistletoe import Document

markdown_text = """
## Heading A
Some text under heading 1.
## Heading B *italic*
Some text under heading 2.
## Heading C **bold**
Some text under heading 3.
"""

# Parse the Markdown text

doc = Document(markdown_text)

for token in doc.children:
    if isinstance(token, mistletoe.block_token.Heading):
        # heading_content = " ".join(str(child.content) for child in token.children if hasattr(child, 'content'))
        heading_content = token.content
        print(f'Level: {token.level}, Content: {heading_content}')

will output

Level: 2, Content: Heading C **bold**
Level: 2, Content: Heading C **bold**
Level: 2, Content: Heading C **bold**
pbodnar commented 9 months ago

Hi @clehene, thanks for the report. For elements like BlockCode or HtmlBlock, the content property/attribute makes sense (see #163). For others, like Heading, not so much.

As you are not the first to ask this question (see #99), I think the simplest thing to do now is to rename the reported Heading's content class attribute to _content, so that it is clear that it is not a part of the public API. That would be in-line e.g. with _open_info in the CodeFence token class.

I can also imagine we could introduce a getter property called text or similar that would recursively concat text from the child tokens, but that would be for another feature request...