mwilliamson / python-mammoth

Convert Word documents (.docx files) to HTML
BSD 2-Clause "Simplified" License
810 stars 121 forks source link

Result of the html conversion to multiline #124

Closed AndreaBiondaro closed 2 years ago

AndreaBiondaro commented 2 years ago

It would be nice if the result of the html conversion were divided into several lines and not in a single line, so as to make the output more readable.

So at the moment the result is:

<p>Lorem ipsum </p><h1>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. </h1>

After it would be:

<p>Lorem ipsum </p>
<h1>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. </h1>

Probably to introduce this behavior it is enough to modify the function "as_string" of class HtmlWriter by adding the line terminator when all the strings are concatenated.

https://github.com/mwilliamson/python-mammoth/blob/9b71a403e748db28b24637164cfa1155ad078396/mammoth/writers/html.py#L28-L29

mwilliamson commented 2 years ago

This is tricky for a couple of reasons: firstly, different people have different preferences for how HTML should be formatted. Secondly, new lines are semantically significant in HTML, so this may change the rendered output. My suggestion would be to use another library to format the HTML that Mammoth produces in your preferred style.