matthewwithanm / python-markdownify

Convert HTML to Markdown
MIT License
1.16k stars 140 forks source link

Indent before HTML block elements causes indent in Markdown output #98

Closed chrispy-snps closed 3 days ago

chrispy-snps commented 1 year ago

In our HTML, block elements are indented:

<html>
  <body>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit,
      sed do eiusmod tempor incididunt ut labore et dolore magna
      aliqua. Ut enim ad minim veniam, quis nostrud exercitation
      ullamco laboris nisi ut aliquip ex ea commodo consequat.
    </p>
  </body>
</html>

When HTML with indented block elements is converted, the indent causes incorrect formatting in the output.

Converting this indented <p> element:

from markdownify import markdownify as md

print(repr(md("""\
  <p>This is
     some text.</p>
""")))

produces this:

' This is\n some text.\n\n\n'
 ^       ^^^

It happens for non-<p> elements too. Converting these indented <h1> elements with the UNDERLINED and ATX heading formats:

print(repr(md("""\
    <h1>Title</h1>
""")))

print(repr(md("""\
    <h1>Title</h1>
""", heading_style="ATX")))

produces this:

' Title\n=====\n\n\n'
 ^

' # Title\n\n\n'
 ^

As a workaround, we iterate through all text object descendants in all text-containing block elements (<p>, <entry>, <li>, etc.) and convert newlines to spaces, but this is expensive on large document sets.

Possibly related to #31.

chrispy-snps commented 10 months ago

This seems to be a duplicate of issue #96.

mirabilos commented 10 months ago

or rather #88 perhaps