taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

querySelectorAll nested search not working #141

Closed AlenToma closed 2 years ago

AlenToma commented 3 years ago

I am trying to use section > .column but i am not getting any result.

HTML

<section>
<section>
<div class="column"></div>
</section>
</section>
taoqf commented 3 years ago

I did some test on this, and it seems went very well.

const { parse } = require('../dist');

describe('queryselector', function () {
    it('shoud query one node', function () {
        const content = `<section>
<section>
<div class="column">foo</div>
</section>
</section>`;
        const root = parse(content);
        const div = root.querySelector('section > .column');
        div.innerHTML.should.eql('foo');
        const list = root.querySelectorAll('section > .column');
        list.length.should.eql(1);
        const div2 = list[0];
        div2.should.eql(div);
    });
});
AlenToma commented 3 years ago

Hmm could it be I am using an older version ? I am using Version 2.2.1 This is the code I am using.

import { parse } from 'node-html-parser';
export default class httpClient {
  static async getHtml(
    url: string
  ): Promise<HTMLDivElement> {
    console.log(`Sending html request to ${url}`);

    var container = parse('<div>test</div>') as any;
    try {
      let headers = new Headers({
        Accept: '*/*',
        'User-Agent':
          'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
      });

      var data = await httpClient.fetchWithTimeout(url, {
        timeout: 30000,
        headers: headers,
        method: 'GET'
      });
      if (data.status === 1020) {
        const message = `An error has occured:${data.status}`;
        console.log(message);
      }
      else
        if (!data.ok) {
          const message = `An error has occured:${data.status}`;
          console.log(message);
        } else {
          console.log('Data is ok. proceed to parse it');
          var html = await data.text();
          html = html.replace(/<!DOCTYPE html>/g, "");
          container = parse('<div>' + html + '</div>');
          console.log("Data has been parsed");
        }
    } catch (e) {
      console.log(e);
    }
    return container;
  }

}

And then executing

 var container = await HttpClient.getHtml(url);
var items = Array.from(container.querySelectorAll("section > .column")) // which I do not get any result here.

If I do this I will get result, but will also get unwanted result herkie that I am not interested in

var items = Array.from(container.querySelectorAll("section .column"))

I am using this in react-native project.

AlenToma commented 3 years ago

Ok I saw this comment now

Note: Full css3 selector supported since v3.0.0.

Will close this issue and upgrade to the latest version.

aandis commented 2 years ago

@taoqf I'm seeing this in the latest 5.4.2-0 version. I get empty results for .querySelectorAll("a a") or .querySelectorAll("a > a") if such elements exist.

aandis commented 2 years ago

ah I see what's happening. A DOM like

<a href="/test">Lorem Ipsum <a href="/foo"><span>bar</span></a></a>

is being changed to

<a href="/test">Lorem Ipsum </a> <a href="/foo"><span>bar</span></a>

on parsing.

aandis commented 2 years ago

I wouldn't expect HTMLParser.parse to change the structure of my input DOM. Sounds like a core bug.

taoqf commented 2 years ago

Yes, https://github.com/taoqf/node-html-parser/issues/144 related.

nonara commented 2 years ago

Hi all! I believe nested href tags are invalid HTML.

If memory serves, we handle this the standard way that other parsers and browsers do, by terminating the tag, which would be considered proper behaviour.

I will confirm tomorrow if that is correct and follow up.

nonara commented 2 years ago

@aandis Fixed in the latest v6.0.0