mixmark-io / turndown

🛏 An HTML to Markdown converter written in JavaScript
https://mixmark-io.github.io/turndown
MIT License
8.52k stars 864 forks source link

Support for nested <ul>/<ol> tags without <li> (not technically valid) #445

Open dnotes opened 1 year ago

dnotes commented 1 year ago

I'm opening this pull request for your re-consideration of issue #232. My reasoning is this:

  1. Although this syntax is invalid HTML, it has de-facto support in all browsers.
  2. Some people are making this mistake in writing their HTML.
  3. Converting that HTML to Markdown changes the semantic meaning that is clear in the original.
  4. It wasn't that hard to fix.

Please see my comment on #232.

Note that this PR differs slightly from the code that I pasted there; I think it's unnecessary to "fix" the case with an empty <li> element (<li><ul>...</ul></li>) because the meaning is preserved in the resulting markdown.

alexander-turner commented 1 month ago

I support merging this PR. I had to check out the repo and cherry-pick these commits using gh pr checkout 445. It fixed my problem.

pavelhoral commented 1 month ago

I don't think "being simple / easy to fix" or "browsers are able to display that" is a valid argument here. The decision for this library (made long time ago) is that it supports valid HTML and nothing more. We don't want to break this rule as it might invite all sorts of non-standard hacks.

However we want to allow users to modify rules and extend the functionality if they want to do that. That is a long standing task (unfortunatelly with very little momentum behind it at the moment).

But various errors in the HTML can be easily fixed by simply pre-processing the DOM before passing it to Turndown. In that way you don't have to make Turndown do anything non-standard. I think that would be ideal solution for this usecase.

(side note - I am not a full maintainer here, I have simply fixed few things here and there, so my opinion is not binding)

dnotes commented 1 month ago

I can understand saying that this shouldn't be supported, and I'll respect if that's the decision because of performance or principles or whatever, and I don't want to be difficult...but the phrase "browsers are able to display that" drastically and unfairly understates the argument. This is a de-facto standard, not a hack. As far as I can tell, absolutely every browser displays this HTML in exactly the same way by default. The lists are also clearly nested in the code. Turndown corrupts the meaning of content that has been clearly and consistently conveyed by every browser since at least the days of IE8. Browser screenshots from BrowserStack below.

Screen Shot 2024-06-24 at 12 51 37 PM

I think this is as strongly as I can state the case; I'm gonna back out now. Cheers!

dnotes commented 1 month ago

Oh, and even if this is not committed, you shouldn't have to cherry-pick the commits or preprocess the DOM to fix this in your own code; since Turndown does allow you to add and replace the rules as needed, you should be able to copy/paste the rules for lists and list items from the Turndown code and make the modifications there. I'm posting some code from one project for reference, but you might have to change it for your uses. Hope this helps @alexander-turner so that you don't have to basically maintain your own fork of the entire project.

let toMd = new TurndownService({headingStyle: 'atx', emDelimiter: '*', codeBlockStyle: 'fenced'})

  toMd.addRule('list', {
    filter: ['ul', 'ol'],

    replacement: function (content, node) {
      var parent = node.parentNode

      // CHANGE: Allow for <ul> and <ol> that are improperly nested outside of <li>.
      if (node.parentNode.nodeName.match(/^(UL|OL)$/i)) {
        content = '    ' + content
          .replace(/^\n+/, '') // remove leading newlines
          .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
          .replace(/\n/gm, '\n    ') // indent
      }

      if (parent.nodeName === 'LI' && parent.lastElementChild === node) {
        return '\n' + content
      } else {
        return '\n\n' + content + '\n\n'
      }
    }
  })
 .addRule('listItem', {
    filter: 'li',

    replacement: function (content, node, options) {
      content = content
        .replace(/^\n+/, '') // remove leading newlines
        .replace(/\n+$/, '\n') // replace trailing newlines with just a single one
        .replace(/\n/gm, '\n    ') // indent

      var prefix = options.bulletListMarker + '   '
      var parent = node.parentNode

      // CHANGE: Don't output two markers if there is an empty li.
      if (node.children.length === 1 && node.children[0].nodeName.match(/^(UL|OL)$/i) && node.textContent === node.children[0].textContent ) prefix = '    '

      else if (parent.nodeName === 'OL') {
        var start = parent.getAttribute('start')
        // CHANGE: When <ol> is improperly nested outside of <li>, get the numbering correct.
        var index = Array.prototype.indexOf.call(Array.prototype.filter.call(parent.children, el => el.nodeName === 'LI'), node)
        prefix = (start ? Number(start) + index : index + 1) + '.  '
      }

      return (
        prefix + content + (node.nextSibling && !/\n$/.test(content) ? '\n' : '')
      )
    }
  })