mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.93k stars 880 forks source link

Links are being generated for headings. #483

Closed otabekoff closed 2 months ago

otabekoff commented 3 months ago

I'm having issue of getting links of HTML headings. I've just tested it with Turndown README file on GitHub and NPMJS.

Generated content

# Turndown

[](https://www.npmjs.com/package/turndown#turndown)

Convert HTML into Markdown with JavaScript.

## Project Updates

[](https://www.npmjs.com/package/turndown#project-updates)

Expected to get

# Turndown

Convert HTML into Markdown with JavaScript.

## Project Updates

I've implemented a PasteAsMarkdown.vue component page/view and my rules are not working.

Here is the full code: ```vue ```
otabekoff commented 3 months ago

When I explain my issue to ChatGPT, I've got this:

Report: Handling Headers Inside Anchor Tags in Turndown

Issue Summary

In HTML-to-Markdown conversion using Turndown, headers that are nested inside anchor tags (<a>) result in incorrect Markdown output. This issue arises from the fact that Markdown does not support headers inside links, leading to inconsistencies in rendering when the content is converted.

Description

When Turndown processes HTML containing headers nested within anchor tags, the resulting Markdown output includes the link text incorrectly formatted. For example, an HTML structure like:

<a href="/foo"><h3>bar</h3></a>

is converted to:

### bar

](/foo)`

This results in an undesirable format when rendered, as GitHub and other Markdown processors display:

[bar](/foo)

This output is inconsistent with standard Markdown behavior and can be confusing for users.

Proposed Solution

To address this issue, a custom rule was proposed to handle links inside headers by stripping out the link elements and preserving only the text content. The revised rule ensures that headers are formatted correctly without including the link markup. The proposed solution involves:

  1. Filtering Headings: Identifying HTML heading elements (<h1> to <h6>) for processing.
  2. Removing Links: Extracting and removing links and other inline elements within the heading while preserving text content.
  3. Formatting Output: Constructing the Markdown heading with the appropriate level and cleaned text content.

Implementation

The updated custom rule is added to the Turndown service as follows:

// Custom rule to remove links from headings
turndownService.addRule('removeLinksFromHeadings', {
  filter: function (node) {
    return node.nodeName.match(/^H[1-6]$/);
  },
  replacement: function (content, node) {
    const level = Number(node.nodeName.charAt(1));
    const prefix = '#'.repeat(level);

    let textContent = '';
    node.childNodes.forEach(child => {
      if (child.nodeType === Node.TEXT_NODE) {
        textContent += child.textContent;
      } else if (child.nodeType === Node.ELEMENT_NODE) {
        if (child.nodeName === 'A') {
          textContent += child.textContent;
        } else {
          textContent += new TurndownService().turndown(child.outerHTML);
        }
      }
    });

    return `\n\n${prefix} ${textContent.trim()}\n\n`;
  }
});

Key Points

Impact

The custom rule effectively resolves the issue of incorrect Markdown output for headers inside anchor tags. By removing links and ensuring proper heading formatting, it improves the consistency and readability of the converted Markdown content.

Next Steps

This solution enhances Turndown's capability to handle complex HTML structures and provides a more reliable conversion to Markdown.

pavelhoral commented 3 months ago

I don't think this this is something Turndown needs to support. The following works perfectly fine:

<h1><a href="/foo">bar</a></h1>
# [bar](/foo)

So the issue is only when <a> element is outside the heading element. You can preprocess such content to switch the order of elements if you know your content is formatted like that.

By the way thank you for clearly stating that ChatGPT helped you with generating the report.

otabekoff commented 3 months ago

@pavelhoral But, the answer of ChatGPT didn't help me. Because, it is not helping me to get rid of the anchor elements. Can you help me with this problem?

I want to get something like this:

# Turndown

Convert HTML into Markdown with JavaScript.

## Project Updates

not something like this:

# Turndown

[](https://www.npmjs.com/package/turndown#turndown)

Convert HTML into Markdown with JavaScript.

## Project Updates

[](https://www.npmjs.com/package/turndown#project-updates)
the-djmaze commented 3 months ago

Read the HTML specifications regarding block and inline elements. Then you know that block elements can't be inside inline elements. And A is inline, and H1 is block. So why do you have terrible invalid HTML?

otabekoff commented 2 months ago

Read the HTML specifications regarding block and inline elements. Then you know that block elements can't be inside inline elements. And A is inline, and H1 is block. So why do you have terrible invalid HTML?

Ok, but even if h1 elements are not in anchor tags, then they may be next to h1 elements. Let's say I'm implementing a "Paste as Markdown" stuff using Turndown, when I copy some pages created with Vitepress like Static Site Generators, those headings have anchors next to them. When I copy whole page and paste those links are being also pasted. I don't want them. But my code is not working I've provided above.

@the-djmaze

the-djmaze commented 2 months ago

those headings have anchors next to them

That is way old school and totally unnecessary. Anchors should be like <h1 id="anchorname">

martincizek commented 2 months ago

When I explain my issue to ChatGPT, I've got this:

Report: Handling Headers Inside Anchor Tags in Turndown

Issue Summary

In HTML-to-Markdown conversion using Turndown, headers that are nested inside anchor tags (<a>)

ChatGPT just misunderstood your problem. There are no headers nested in anchor tags in your use case. The rest is garbage.


First of all, all the subsequent posters are right that headers nested in anchor tags would be an ugly mess. It would be fixable through DOM preprocessing and I even do have code for exchanging nodes selected by xpath with their parent nodes. No need for it though.

What you need is just identifying the anchor elements and get rid for them. The following code should work for browser-side HTML generated by GitHub.

const TurndownService = require('turndown');
const turndownService = new TurndownService();

turndownService.addRule('anchor', {
    filter: function (node) {
        const aclass = node.getAttribute("class");
        return node.nodeName === 'A' && aclass === 'anchor';
    },
    replacement: function (content, node) {
        return content; // safer, but some anchors may contain something like '#'
        // return ''; // use only when sure there can't be any valuable content
    },
});

// copied from the actual GitHub Markdown rendering in browser developer tools
const html = `
<div class="markdown-heading" dir="auto">
    <h1 tabindex="-1" class="heading-element" dir="auto">Turndown</h1>
    <a id="user-content-turndown" class="anchor" aria-label="Permalink: Turndown" href="#turndown">
        <svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16"
            aria-hidden="true">
            <path d="m7.775 3.275 1.25-1.25a3.5 3.5 0 1 1 4.95 4.95l-2.5 2.5a3.5 3.5 0 0 1-4.95 0 .751.751 0 0 1 .018-1.042.751.751 0 0 1 1.042-.018 1.998 1.998 0 0 0 2.83 0l2.5-2.5a2.002 2.002 0 0 0-2.83-2.83l-1.25 1.25a.751.751 0 0 1-1.042-.018.751.751 0 0 1-.018-1.042Zm-4.69 9.64a1.998 1.998 0 0 0 2.83 0l1.25-1.25a.751.751 0 0 1 1.042.018.751.751 0 0 1 .018 1.042l-1.25 1.25a3.5 3.5 0 1 1-4.95-4.95l2.5-2.5a3.5 3.5 0 0 1 4.95 0 .751.751 0 0 1-.018 1.042.751.751 0 0 1-1.042.018 1.998 1.998 0 0 0-2.83 0l-2.5 2.5a1.998 1.998 0 0 0 0 2.83Z"></path>
        </svg>
    </a>
</div>`;

console.log(turndownService.turndown(html)); // works as OP expected

Closing remark: Turndown should not be opinionated and the default processing should be as straightforward as possible. That's why these anchors are rendered by default - they are in the source. However, it's possible that we'll publish some DOM preprocessing mini-library and some rule cookbook in the future.

Let me know if this solves the problem.

otabekoff commented 2 months ago

Yes, that resolved my issue, so I'll go ahead and close it now. By the way, I really like your idea of a DOM preprocessing mini-library and a rule cookbook—it sounds exciting! I'd love to see it come to life someday.

Just curious, have you thought about setting up a GitHub Projects board? It could be a great way to organize and track your future plans. And if you do, be sure to add this feature to the list linking this thread. If you ever decide to work on it and remember me, feel free to drop an update here—I’d love to check it out .