Closed otabekoff closed 2 months ago
When I explain my issue to ChatGPT, I've got this:
In HTML-to-Markdown conversion using Turndown, headers that are nested inside anchor tags (<a>
) result in incorrect Markdown output. This issue arises from the fact that Markdown does not support headers inside links, leading to inconsistencies in rendering when the content is converted.
When Turndown processes HTML containing headers nested within anchor tags, the resulting Markdown output includes the link text incorrectly formatted. For example, an HTML structure like:
<a href="/foo"><h3>bar</h3></a>
is converted to:
### bar
](/foo)`
This results in an undesirable format when rendered, as GitHub and other Markdown processors display:
[bar](/foo)
This output is inconsistent with standard Markdown behavior and can be confusing for users.
To address this issue, a custom rule was proposed to handle links inside headers by stripping out the link elements and preserving only the text content. The revised rule ensures that headers are formatted correctly without including the link markup. The proposed solution involves:
<h1>
to <h6>
) for processing.The updated custom rule is added to the Turndown service as follows:
// Custom rule to remove links from headings
turndownService.addRule('removeLinksFromHeadings', {
filter: function (node) {
return node.nodeName.match(/^H[1-6]$/);
},
replacement: function (content, node) {
const level = Number(node.nodeName.charAt(1));
const prefix = '#'.repeat(level);
let textContent = '';
node.childNodes.forEach(child => {
if (child.nodeType === Node.TEXT_NODE) {
textContent += child.textContent;
} else if (child.nodeType === Node.ELEMENT_NODE) {
if (child.nodeName === 'A') {
textContent += child.textContent;
} else {
textContent += new TurndownService().turndown(child.outerHTML);
}
}
});
return `\n\n${prefix} ${textContent.trim()}\n\n`;
}
});
The custom rule effectively resolves the issue of incorrect Markdown output for headers inside anchor tags. By removing links and ensuring proper heading formatting, it improves the consistency and readability of the converted Markdown content.
This solution enhances Turndown's capability to handle complex HTML structures and provides a more reliable conversion to Markdown.
I don't think this this is something Turndown needs to support. The following works perfectly fine:
<h1><a href="/foo">bar</a></h1>
# [bar](/foo)
So the issue is only when <a>
element is outside the heading element. You can preprocess such content to switch the order of elements if you know your content is formatted like that.
By the way thank you for clearly stating that ChatGPT helped you with generating the report.
@pavelhoral But, the answer of ChatGPT didn't help me. Because, it is not helping me to get rid of the anchor elements. Can you help me with this problem?
I want to get something like this:
# Turndown
Convert HTML into Markdown with JavaScript.
## Project Updates
not something like this:
# Turndown
[](https://www.npmjs.com/package/turndown#turndown)
Convert HTML into Markdown with JavaScript.
## Project Updates
[](https://www.npmjs.com/package/turndown#project-updates)
Read the HTML specifications regarding block and inline elements. Then you know that block elements can't be inside inline elements. And A is inline, and H1 is block. So why do you have terrible invalid HTML?
Read the HTML specifications regarding block and inline elements. Then you know that block elements can't be inside inline elements. And A is inline, and H1 is block. So why do you have terrible invalid HTML?
Ok, but even if h1 elements are not in anchor tags, then they may be next to h1 elements. Let's say I'm implementing a "Paste as Markdown" stuff using Turndown, when I copy some pages created with Vitepress like Static Site Generators, those headings have anchors next to them. When I copy whole page and paste those links are being also pasted. I don't want them. But my code is not working I've provided above.
@the-djmaze
those headings have anchors next to them
That is way old school and totally unnecessary.
Anchors should be like <h1 id="anchorname">
When I explain my issue to ChatGPT, I've got this:
Report: Handling Headers Inside Anchor Tags in Turndown
Issue Summary
In HTML-to-Markdown conversion using Turndown, headers that are nested inside anchor tags (
<a>
)
ChatGPT just misunderstood your problem. There are no headers nested in anchor tags in your use case. The rest is garbage.
First of all, all the subsequent posters are right that headers nested in anchor tags would be an ugly mess. It would be fixable through DOM preprocessing and I even do have code for exchanging nodes selected by xpath with their parent nodes. No need for it though.
What you need is just identifying the anchor elements and get rid for them. The following code should work for browser-side HTML generated by GitHub.
const TurndownService = require('turndown');
const turndownService = new TurndownService();
turndownService.addRule('anchor', {
filter: function (node) {
const aclass = node.getAttribute("class");
return node.nodeName === 'A' && aclass === 'anchor';
},
replacement: function (content, node) {
return content; // safer, but some anchors may contain something like '#'
// return ''; // use only when sure there can't be any valuable content
},
});
// copied from the actual GitHub Markdown rendering in browser developer tools
const html = `
<div class="markdown-heading" dir="auto">
<h1 tabindex="-1" class="heading-element" dir="auto">Turndown</h1>
<a id="user-content-turndown" class="anchor" aria-label="Permalink: Turndown" href="#turndown">
<svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16"
aria-hidden="true">
<path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path>
</svg>
</a>
</div>`;
console.log(turndownService.turndown(html)); // works as OP expected
Closing remark: Turndown should not be opinionated and the default processing should be as straightforward as possible. That's why these anchors are rendered by default - they are in the source. However, it's possible that we'll publish some DOM preprocessing mini-library and some rule cookbook in the future.
Let me know if this solves the problem.
Yes, that resolved my issue, so I'll go ahead and close it now. By the way, I really like your idea of a DOM preprocessing mini-library and a rule cookbook—it sounds exciting! I'd love to see it come to life someday.
Just curious, have you thought about setting up a GitHub Projects board? It could be a great way to organize and track your future plans. And if you do, be sure to add this feature to the list linking this thread. If you ever decide to work on it and remember me, feel free to drop an update here—I’d love to check it out .
I'm having issue of getting links of HTML headings. I've just tested it with Turndown README file on GitHub and NPMJS.
Generated content
Expected to get
I've implemented a PasteAsMarkdown.vue component page/view and my rules are not working.
Here is the full code:
```vue