taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Can't select some element. #203

Closed Tajmirul closed 1 year ago

Tajmirul commented 2 years ago

I am trying to fetch the title and description of a Vimeo video. I brought the HTML successfully. But I can't select the description div.

Here is the code:

const { videoUrl } = req.body;
const vimeoResponse = await fetch(videoUrl);
const vimeoResponseTxt = await vimeoResponse.text();
const vimeoHtml = parse(vimeoResponseTxt);
const title = vimeoHtml.querySelector('meta[property=og:title]').getAttribute('content');
const description = vimeoHtml.innerHTML;

fs.writeFile('vimeo-video.html', description, error => {
    console.log(error);
});

this code brings the HTML. The HTML contains a div with class description-wrapper.

                  <div class="clip_details-description description-wrapper iris_desc">
                    <p class="first">Country music legend, Trish Cotton, has something to say.</p>
                    <p>
                      Written by Kyle Kasabian (@kylekasabian) <br />
                      Directed by Derek Mari (@directorderek)<br />
                      Director of Photography: Peter Mickelsen<br />
                      Produced by Derek Mari and Kyle Kasabian<br />
                      Edited by Derek Mari
                    </p>
                    <p>Starring: Alyssa Sabo, Janine Hogan, and Kyle Kasabian</p>
                    <p>
                        Assistant Camera: Casey Schoch<br />
                        Production Sound: David Alvarez<br />
                        Production Assistant: Keith Ahlstrom
                    </p>
                    <p>Music by Morgan Matthews</p>
                    <p>
                      Blink &amp; Miss Productions<br />
                      Bad Cat Films
                    </p>
                  </div>
                </div>

But when I try to select the div by querySelector it returns null.

const description = vimeoHtml.querySelector('.description-wrapper');
console.log(description); // null
taoqf commented 2 years ago

I'm afraid I could not find where the problem is.

const vimeoHtml = parse(`<div class="clip_details-description description-wrapper iris_desc">
                    <p class="first">Country music legend, Trish Cotton, has something to say.</p>
                    <p>
                      Written by Kyle Kasabian (@kylekasabian) <br />
                      Directed by Derek Mari (@directorderek)<br />
                      Director of Photography: Peter Mickelsen<br />
                      Produced by Derek Mari and Kyle Kasabian<br />
                      Edited by Derek Mari
                    </p>
                    <p>Starring: Alyssa Sabo, Janine Hogan, and Kyle Kasabian</p>
                    <p>
                        Assistant Camera: Casey Schoch<br />
                        Production Sound: David Alvarez<br />
                        Production Assistant: Keith Ahlstrom
                    </p>
                    <p>Music by Morgan Matthews</p>
                    <p>
                      Blink &amp; Miss Productions<br />
                      Bad Cat Films
                    </p>
                  </div>
                </div>`);
        const description = vimeoHtml.querySelector('.description-wrapper');
        description.toString().should.eql('<ul id="list"><li><a href="#">Some link</a></li></ul>');
wolfie commented 1 year ago

I'm not sure if this is exactly related, but this outputs "null" for me for node-html-parser@6.1.4 and node version 17.4.0:

import { parse } from "node-html-parser";
console.log(
  parse(
    `<html><body><pre><code class="language-typescript">type Foo = { foo: 'bar' }</code></pre></body></html>`
  ).querySelector("code")
);
wolfie commented 1 year ago

It seems like the bug is in the PRE tag - there's an assumption that it can't have child nodes:

import { parse } from "node-html-parser";

const convert = root => ({
  tag: root.tagName,
  textContent: root.textContent,
  children: [...root.childNodes].map(convert),
});

const tree = convert(
  parse(`<html><body><pre><code class="language-typescript">type Foo = { foo: 'bar' }</code></pre></body></html>`)
);

console.log(JSON.stringify(tree, null, 2));

This outputs:

{
  "tag": null,
  "textContent": "<code class=\"language-typescript\">type Foo = { foo: 'bar' }</code>",
  "children": [
    {
      "tag": "HTML",
      "textContent": "<code class=\"language-typescript\">type Foo = { foo: 'bar' }</code>",
      "children": [
        {
          "tag": "BODY",
          "textContent": "<code class=\"language-typescript\">type Foo = { foo: 'bar' }</code>",
          "children": [
            {
              "tag": "PRE",
              "textContent": "<code class=\"language-typescript\">type Foo = { foo: 'bar' }</code>",
              "children": [
                {
                  "textContent": "<code class=\"language-typescript\">type Foo = { foo: 'bar' }</code>",
                  "children": []
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}
taoqf commented 1 year ago

@wolfie try this


parse(html, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
    }
});