mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.52k stars 864 forks source link

Code comments starting with "#" in <span> tags are appearing as h1 headers #470

Open humanismusic opened 1 month ago

humanismusic commented 1 month ago

Hi,

I'm testing turndown on a page that includes sample python code with comments, example:

<span class="comment"># here's a comment</span>.

There are no pre or code tags used in the HTML. Each time I run turndown, it results in this line being output as is and displaying like an H1:

here's a comment

Here is the rule I'm using to attempt to excude these:

// handle potential code comments or lines starting with "#" in <span> tags
    turndownService.addRule('spanWithHash', {
        filter: (node) => {
            const hasHash = node.nodeName === 'SPAN' && node.textContent.trim().startsWith('#');
            if (hasHash) {
                console.log("Found span filter #: ", node.textContent.trim()); // log for debugging
            }
            return hasHash;
        },
        replacement: (content, node) => {
            return content.trim().substring(1);
        }
    });

I also tried escaping etc before deciding to try remove the # all together. Despite this, the text within these tags is still being treated as headers in the resulting markdown.

Could you help identify why this rule isn't working as intended?

martincizek commented 1 month ago

I'm testing turndown on a page that includes sample python code with comments, example:

<span class="comment"># here's a comment</span>.

There are no pre or code tags used in the HTML. Every time it results in h1 output like so:

here's a comment

Can't reproduce it, can you please make up a complete input that reproduce the problem?

image