html-to-text / node-html-to-text

Advanced html to text converter
Other
1.61k stars 223 forks source link

h1 and P are conflicting #298

Closed chantorak closed 1 year ago

chantorak commented 1 year ago

Minimal HTML example

-

Options

const { convert } = require('html-to-text');

const text = convert(content, {
    baseElements: {
        selectors: ["p", "h1"]
    }
});
console.log(text);

Observed output

HEALTH.
WELLNESS.
HAPPINESS.

Expected output

The expected output to include P and h1 contents

Version information


If remove the h1, the output will included the P content, somehow the h1 is conflicting with the P

KillyMXI commented 1 year ago

That's not minimal HTML example.

const html = `
<h1>heading</h1>
<p>paragraph</p>
<h1>heading</h1>
<p>paragraph</p>
<h1>heading</h1>
<p>paragraph</p>`;

const options = { baseElements: { selectors: ['p', 'h1'] } };

const text = htmlToText(html, options);
console.log(text);

Output:

paragraph

paragraph

paragraph

HEADING  

HEADING

HEADING

Front page of https://nutritionhappiness.com/ contains one <h1> heading with the content you've provided and one <p> paragraph which is empty.

<h1 class="t677__title t-title t-title_xs " field="title" style="font-size:66px;"><div style="line-height:68px;" data-customstyle="yes"><i>Health.<br>Wellness.<br>Happiness.<strong></strong></i></div></h1>
<p class="gm-style-mot"></p>

Not sure what content you observe when you remove h1 and only keep p from selectors. If no base elements are found, default value for baseElements.returnDomByDefault is true and that will result in entire page being processed. But with one empty paragraph that should result in empty output. You are probably doing something wrong and I can't tell you what exactly.

Typical cause of issues like this - wrong idea about actual input HTML content.

chantorak commented 1 year ago

Thanks for the reply, there doesn't seem to be P tags, text are wrapped in div