taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

multiple HTMLElement wrapping after parse #256

Closed SayWut closed 11 months ago

SayWut commented 11 months ago

I encounter a weird behaver once I tried to use removeChild function. At first it didn't work but I couldn't thought why.

I found out that for some reason after I parse my html some part of it has 2 HTMLElement wrapping it for no reason. So to solve my issue I forced to use the firstChild property in order to make it to work

Faulty code:

const test = () => {
    const body = fs.readFileSync("./index.html", { encoding: "utf8" });
    let jobDescriptionElement = parse.parse(body);

    const jobRequirementsElement = jobDescriptionElement?.querySelector(".PT15");
    jobDescriptionElement?.removeChild(jobRequirementsElement!);
}

Working code:

const test = () => {
    const body = fs.readFileSync("./index.html", { encoding: "utf8" });
    let jobDescriptionElement = parse.parse(body).firstChild as HTMLElement;

    const jobRequirementsElement = jobDescriptionElement?.querySelector(".PT15");
    jobDescriptionElement?.removeChild(jobRequirementsElement!);
}

index.html file:

<div>
    required Senior Backend Engineer, Demand<br />we currently work in a hybrid
    work model giving employees the flexibility to work from home a few days a
    week. We have offices in Ramat Gan and are looking to grow to additional
    locations such as Jerusalem, where our Taboolars have the opportunity to meet
    their teammates, connect with other teams and socialize with friends.<br /><br />What
    are some of the things you do on a day-to-day basis?<br /><br />Develop one of
    the largest real time big data operation in the world to support over 40TB of
    new data every day<br />Have end to end ownership: Design, build, ship,
    measure and maintain our products for the biggest publishers and brands
    advertisers in the world.<br />Influence directly on the way billions of
    people discover the internet
    <div class="PT15">
      <b>Requirements: </b><br />
      Server: Java, Spark, Kafka, Hadoop, Cassandra, Vertica, MySQL, HDFS,
      BigQuery, Docker, Kubernetes, Prometheus, Grafana, Airflow, Redis<br /><br />Client:
      React, Vanilla JavaScript, ES6+, Webpack, HTML, CSS<br /><br />What are the
      skills a good Software Engineer needs to have?<br /><br />Proven experience
      as a Full Stack Developer<br />Experienced in designing and developing large
      scale distributed systems<br />Deep understanding and strong Computer
      Science fundamentals: object-oriented design, data structures, applications
      programming and multithreading programming<br />3+ years programming
      experience in Java or equivalent Object-Oriented language (preferably Java +
      Spring)<br />Able to technically lead and mentor other team members<br />It
      would be great if you also have: <br /><br />BSc in computer science or
      equivalent Nice to have
      <div class="position-intended-for-english">
        This position is open to all candidates.
      </div>
    </div>
    <div></div>
  </div>

This feels like weird behaver of the parser. I encounter it on a bigger code this is a small test that I did to check it and it did the same thing. on the other code I used getElementById from a bigger html page.

taoqf commented 11 months ago

That's because jobDescriptionElement just a wrapper, in some case we should parse html correctly like this:

<foo></foo>
<bar></bar>

Either querySelector or getElementById would get the right element you need.