taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

issue finding href of <a> tag #242

Closed psquared-dev closed 1 year ago

psquared-dev commented 1 year ago

Here is the code:

  const root = parse(html);
  const links = root.querySelectorAll("a");

  for (const a of links) {
    console.log(a.rawAttrs); 
    }
  }

a.rawAttrs returns 'href="/" rel="home"' but a.getAttribute("href') returns undefined.

Also a.attrs always returns an empty object {}.

Ionys320 commented 1 year ago

Hi, If you replace querySelectorAll by getElementsByTagName, you'll be able to get the href by using .getAttribute("href").

const root = parse(html);
const links = root.querySelectorAll("a");

for (const a of links) {
    console.log(a.rawAttrs); 
    console.log(a.getAttribute("href"));
}
excelsior091224 commented 1 year ago

I have same issue. I tried to extract the <code> contained in the <pre> as follows, but what I got back was an empty list. No matter how I look at it, I am not getting the <code>.

<pre>
  <code>test</code>
</pre>

test code

// test
const root = parse(data.content);
const pre_list = root.getElementsByTagName("pre");
pre_list.map((pre) => {
  console.log("pre:"+pre);
});
const pre_code = root.getElementsByTagName("pre code");
console.log("pre_code:"+pre_code);
pre_list.map((pre) => {
  const code = pre.getElementsByTagName("code");
  console.log("code:"+code);
});

result

// pre_list
// 1st <pre>
BlogPreview.tsx:33 pre:<pre><code class="language-typescript">
// omission
// 2nd <pre>
BlogPreview.tsx:33 pre:<pre><code class="language-typescript">  public async getBlogs(queries?: MicroCMSQueries) {
// omission
// 3rd <pre>
BlogPreview.tsx:33 pre:<pre><code>---
// omission
// 4th <pre>
BlogPreview.tsx:33 pre:<pre><code class="language-typescript">import { Cache, CacheContainer } from &quot;node-ts-cache&quot;;
// omission
// 5th <pre>
BlogPreview.tsx:33 pre:<pre><code class="language-json">{
// omission
// pre_code
BlogPreview.tsx:36 pre_code:
// code
5BlogPreview.tsx:39 code:
taoqf commented 1 year ago

@excelsior091224 I'm afraid this is another issue. in you case ,you should just add an options to parse

const root = parse(html, {
    blockTextElements: {
        script: true,
        noscript: true,
        style: true,
    }
});
devansh-sharma-tw commented 1 year ago

@taoqf , this commit (release v6.1.7 onwards) breaks the earlier functionality of ignoring text content of specific tags by setting them as false in blockTextElements, which seems unintended to me.

console.log(parse(htmlString, { blockTextElements: { script: false } }).text) // Output: sample text inside tags

console.log(parse(htmlString, { blockTextElements: { script: true } }).text) // Output: sample text inside tags text inside script

This matches the behavior explained in the [README](https://github.com/taoqf/node-html-parser#parsedata-options) as well.

- This is the behavior after this commit (running `v6.1.7-v6.1.9`):

const htmlString = "sample text inside tags "

console.log(parse(htmlString, { blockTextElements: { script: false } }).text) // Output: sample text inside tags text inside script

console.log(parse(htmlString, { blockTextElements: { script: true } }).text) // Output: sample text inside tags text inside script



Could you please check ?
taoqf commented 1 year ago

@devansh-sharma-tw Sorry for that. You can try v6.1.0 now. @excelsior091224 For your case, you should not pass and empty object as blockTextElement in option. like this:

const html = `<pre>
  <code>test</code>
</pre>`;
const root = parse(html, {
    blockTextElements: {
    }
});
const list = root.getElementsByTagName("code");
const [code] = list;
code.text.should.eql('test');
devansh-sharma-tw commented 1 year ago

@taoqf Thanks for the fix!