mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.81k stars 880 forks source link

Anchor tag with image as grandchild produces invalid Markdown #332

Open SamLeatherdale opened 4 years ago

SamLeatherdale commented 4 years ago

Affected version: 6.0.0 I found a particular case where Turndown will produce incorrect Markdown given the following HTML:

<a href="http://google.com">
  <div>
    <img src="https://via.placeholder.com/150">
  </div>
</a>

It will produce:

[

![](https://via.placeholder.com/150)

](http://google.com) 

which renders like this: (in GitHub and elsewhere) [

](http://google.com)

This occurs whether I am using linkStyle: 'inlined' or 'referenced'.

I have added a working demo on CodeSandbox, including a hacky fix I will be using in my project. https://codesandbox.io/s/turndown-image-link-issue-k12z8?file=/src/App.js

martincizek commented 4 years ago

The thing is that div is a block element, which triggers a paragraph implicitly. Whereas paragraphs are a blocks in Markdown, images are inline and cannot contain a paragraph. So your HTML input represents something not expressible in Markdown.

Consider using span instead or avoid hyperlink-wrapped blocks in another way.

SamLeatherdale commented 4 years ago

This project doesn't appear to be designed to work with perfect HTML though, it seems to be capable of taking crappy HTML from WYSIWYG editors and the like and covert it back into nice, pure Markdown, and in fact has been doing so pretty well so far.

I have been using this in my fork of a project to convert WordPress format exports (originally from Squarespace) to markdown, because the exports I got from a Squarespace site for some reason generated a lot of these <a>s wrapping <div>s wrapping <img>s.

Seems worth trying to produce something valid rather than something that is categorically incorrect, which is what the current output is.

bashby44 commented 3 years ago

@SamLeatherdale did you ever find a workaround for this? I'm running in to the same issue

SamLeatherdale commented 3 years ago

@SamLeatherdale did you ever find a workaround for this? I'm running in to the same issue

It's been a while, but I believe I ran something like Cheerio over it first and stripped out the extra div.

martincizek commented 3 years ago

We did the same as @SamLeatherdale writes on a project a while ago - just with pure DOM. I guess the guys also switched to Cheerio recently. Sharing the relevant snippets:

Helper class:

class DomFilter {
  constructor(dom) {
    this.dom = dom;
    this.window = this.dom.window;
    this.document = this.window.document;
  }

  select(xpath, contextNode = null) {
    const ctx = contextNode || this.document;
    const arr = [];
    const { ORDERED_NODE_ITERATOR_TYPE } = this.window.XPathResult;
    const it = this.document.evaluate(xpath, ctx, null, ORDERED_NODE_ITERATOR_TYPE);
    let node;
    while ((node = it.iterateNext())) {
      arr.push(node);
    }
    return arr;
  }

  // [...]

  // eslint-disable-next-line class-methods-use-this
  unwrap(nodes) {
    nodes.forEach((node) => {
      const parent = node.parentNode;
      while (node.firstChild) {
        parent.insertBefore(node.firstChild, node);
      }
      parent.removeChild(node);
    });
  }

  // [...]
}

Then something like:

const dom = ...;
const f = new DomFilter(dom);

f.unwrap(f.select('//a/div'));

// now pass the modified dom to Turndown, no need to convert it to HTML string
martincizek commented 3 years ago

Ref this:

Seems worth trying to produce something valid rather than something that is categorically incorrect, which is what the current output is.

It's a little more complicated. We need to introduce Markdown contexts to do it both universally and efficiently. This isolated case has indeed a simpler solution, but I don't like the idea to provide case-by-case fixes for these, especially when there is no universally correct solution. With contexts, users would be able to choose what to do with block elements nested in inline elements. It's simple for a div without semantics. But for e.g. a table inside a link, something valid means either discarding user's data or just keeping it in HTML (which denies Turndown itself). So in this cases, users have to choose. Now they can only choose what to do with these cases using HTML preprocessing - unfortunately.

And for the record:

This project doesn't appear to be designed to work with perfect HTML though,

True.

it seems to be capable of taking crappy HTML from WYSIWYG editors and the like and covert it back into nice, pure Markdown, and in fact has been doing so pretty well so far.

It's a very welcome side effect and we try to achieve it. Still, the higher priority is not to lose data. You're right it's a topic to work on, as I suggested above.