html-to-text / node-html-to-text

Advanced html to text converter
Other
1.61k stars 223 forks source link

Scraper options selector skip doesnt work #297

Closed cbsa100 closed 1 year ago

cbsa100 commented 1 year ago

Options


 const options = {
      wordwrap: null,
      selectors: [
        { selector: 'a', options: { ignoreHref: true } },
        { selector: 'img', format: 'skip' },
        { selector: 'nav', format: 'skip' },
        { selector: 'header', format: 'skip' },
        { selector: 'footer', format: 'skip' },
        { selector: '*[data-elementor-type=footer]', format: 'skip' },
        { selector: '*[data-elementor-type=header]', format: 'skip' },
      ],
    };```

**Version information**

    "html-to-text": "^9.0.5",
    "next": "^13.4.8",

----

When trying to scrape a webpage, i try to remove the header, footer images, navs and links to get only the text.
however, for some reason, i get the footer text in the result
i tried this both on elementor sites (with and without the data attributes) and on non-elementor sites (with the footer tag), also tried with and without the astric before the data attribute
KillyMXI commented 1 year ago
const html = `
<header>header</header>
<div data-elementor-type="header">elementor type header</div>
<p>paragraph</p>
<div data-elementor-type="footer">elementor type footer</div>
<footer>footer</footer>`;

const options = {
  wordwrap: null,
  selectors: [
    { selector: 'a', options: { ignoreHref: true } },
    { selector: 'img', format: 'skip' },
    { selector: 'nav', format: 'skip' },
    { selector: 'header', format: 'skip' },
    { selector: 'footer', format: 'skip' },
    { selector: '*[data-elementor-type=footer]', format: 'skip' },
    { selector: '*[data-elementor-type=header]', format: 'skip' },
  ],
};
const text = htmlToText(html, options);
console.log(text);

Outputs only

paragraph

Start reducing your issue to a minimal example to find out what might be wrong in your case.

KillyMXI commented 1 year ago

With no follow-up, I consider this resolved.

Most likely cause - unexpected input HTML and insufficient attention to what input HTNL actually contains and what options are actually used.