mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.52k stars 864 forks source link

Line break in formatting not respected #433

Closed vwkd closed 8 months ago

vwkd commented 1 year ago

Turndown seems to not respect line breaks inside formatting tags like in <strong><br></strong>.

For example

<span>foo</span><strong><br></strong><span>bar</span>
foobar

I'd expect it to generate the same as the following without formatting tags.

<span>foo</span><br><span>bar</span>
foo
bar
taythebot commented 10 months ago

I added <br> parsing with a rule myself

turndownService.addRule('br', {
    filter: 'br',
    replacement: () => '\n',
});
bensquire commented 8 months ago

Seeing the same issue, but a bit more comlpex:

<p>As a <strong>user<br /></strong>I want to</p>

Generates:

As a userI want to

When it should generate:

As a user I want to

We've worked rounf this by doing this before the md conversion:

.replace(/<br \/>?(<\/(strong|em|s|code)>)/gi, '$1<br />')
martincizek commented 8 months ago

Turndown generally doesn't support HTML to MD conversions, where the HTML cannot be represented in MD. This way Turndown can remain quite fast with straightforward rules.

@bensquire Yes, custom preprocessing is the right way to go. Although I'd do it on DOM level, see below.

@taythebot You're converting hard line breaks to soft breaks - i.e. visible in MD source, but not rendered. Which is OK, but probably not what most users want. :) You can achieve the same with setting options.br = ''.

I'd like to introduce a way for users to define easy preprocessing rules on top of DOM. It'd be outside of the Turndown core project. But the preprocessed DOM can be directly passed to Turndown, so there won't be that much overhead.