mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.94k stars 881 forks source link

Line break in formatting not respected #433

Closed vwkd closed 11 months ago

vwkd commented 1 year ago

Turndown seems to not respect line breaks inside formatting tags like in <strong><br></strong>.

For example

<span>foo</span><strong><br></strong><span>bar</span>
foobar

I'd expect it to generate the same as the following without formatting tags.

<span>foo</span><br><span>bar</span>
foo
bar
taythebot commented 1 year ago

I added <br> parsing with a rule myself

turndownService.addRule('br', {
    filter: 'br',
    replacement: () => '\n',
});
bensquire commented 11 months ago

Seeing the same issue, but a bit more comlpex:

<p>As a <strong>user<br /></strong>I want to</p>

Generates:

As a userI want to

When it should generate:

As a user I want to

We've worked rounf this by doing this before the md conversion:

.replace(/<br \/>?(<\/(strong|em|s|code)>)/gi, '$1<br />')
martincizek commented 11 months ago

Turndown generally doesn't support HTML to MD conversions, where the HTML cannot be represented in MD. This way Turndown can remain quite fast with straightforward rules.

@bensquire Yes, custom preprocessing is the right way to go. Although I'd do it on DOM level, see below.

@taythebot You're converting hard line breaks to soft breaks - i.e. visible in MD source, but not rendered. Which is OK, but probably not what most users want. :) You can achieve the same with setting options.br = ''.

I'd like to introduce a way for users to define easy preprocessing rules on top of DOM. It'd be outside of the Turndown core project. But the preprocessed DOM can be directly passed to Turndown, so there won't be that much overhead.