mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.62k stars 870 forks source link

Empty paragraph converting #388

Closed Tilesto closed 3 years ago

Tilesto commented 3 years ago

Wrong empty paragraphs converting.

Example

Source code:

 <p>123</p>
 <p></p>

Expected result:

123

Actual result:

123

Code:

const turndownService = new TurndownService({
          blankReplacement(content, node) {
              const types = ['P'];
              if (types.indexOf(node.nodeName) !== -1) {
                  return `\n\n`;
              }
          },
      });

I don't know how to fix it. Why blankReplacement doesn't replace <p></p> to new empty line? I tried with: \n, \r, \t, \u00A0, ' ' - without success

martincizek commented 3 years ago

This is because of collapsing neighbouring whitespace. Collapsing and outplacing whitespace is essential for creating correct CommonMark syntax (see "delimiter runs" in the spec).

<p></p> is not an empty line, because an empty line is not <p></p>. Actually there is no way to create <p></p> in CommonMark except for using HTML.

So if empty paragraphs are significant for you, you need to output them as HTML in your example:

const turndownService = new TurndownService({
          blankReplacement(content, node) {
              const types = ['P'];
              if (types.indexOf(node.nodeName) !== -1) {
                  return `<p></p>`;
              }
          },
      });

... although consecutive empty paragraphs are usually just unwanted artefacts from some WYSIWYG input.

There is a ticket for preserving insignificant whitespace and softbreaks (#361), but the behaviour you suggest is that a significant empty paragraph in HTML should make an insignificant soft break space (not a paragraph) in Markdown. The more insignificant output we add, the more likely we break the Markdown syntax in certain cases. That's why we don't do it now. Update: it's not even soft break, it's just ignored.

If you really want to convert empty paragraphs to insignificant whitespace, you can post-process the result of the above example and replace <p></p> with \n\n. At your own risk. :)

Tilesto commented 3 years ago

This is because of collapsing neighbouring whitespace. Collapsing and outplacing whitespace is essential for creating correct CommonMark syntax (see "delimiter runs" in the spec).

<p></p> is not an empty line, because an empty line is not <p></p>. Actually there is no way to create <p></p> in CommonMark except for using HTML.

So if empty paragraphs are significant for you, you need to output them as HTML in your example:

const turndownService = new TurndownService({
          blankReplacement(content, node) {
              const types = ['P'];
              if (types.indexOf(node.nodeName) !== -1) {
                  return `<p></p>`;
              }
          },
      });

... although consecutive empty paragraphs are usually just unwanted artefacts from some WYSIWYG input.

There is a ticket for preserving insignificant whitespace and softbreaks (#361), but the behaviour you suggest is that a significant empty paragraph in HTML should make an insignificant soft break (not a paragraph) in Markdown. The more insignificant output we add, the more likely we break the Markdown syntax in certain cases. That's why we don't do it now.

If you really want to convert empty paragraphs to insignificant soft break, you can post-process the result of the above example and replace <p></p> with \n\n. At your own risk. :)

Thanks a lot for the detailed answer! Ok, got it( I need to have empty spaces for my WYSIWYG editor (Tiptap/ProseMirror). Ok, I'll try to do it according with your idea - replace <p></p> with \n\n. But some another way:

 blankReplacement(content, node) {
              const types = ['P'];
              if (types.indexOf(node.nodeName) !== -1) {
                  return `EMPTY_PARAGRAPH`;
              }
          },

and then, as post-converting, replaceAll('EMPTY_PARAGRAPH', '\n\n')

Thanks a lot for the answer and for the idea)

martincizek commented 3 years ago

But some another way

Now you can't ever use "EMPTY_PARAGRAPH", so you can't even quote this conversation, which is sad :)

I'd recommend using dynamic random placeholders. We use this code on another project:

function randomKey() {
  let result = '';
  for (let i = 0; i < 16; i++) {
    result += Math.floor(Math.random() * 16).toString(16);
  }
  return result;
}

... just call it runtime, store the key in a variable and use it both in blankReplacement() and the post-processing replace.

Good luck!

(it's even possible that placeholders make their way to Turndown, but not sure for now)

Tilesto commented 3 years ago

@martincizek ok, thanks for one more idea) But, honestly, I didn't catch this -

Now you can't ever use "EMPTY_PARAGRAPH", so you can't even quote this conversation, which is sad :)

Why not? :D