html-extract / hext

Domain-specific language for extracting structured data from HTML documents
https://hext.thomastrapp.com
Apache License 2.0
52 stars 3 forks source link

"All siblings of type" issue #7

Closed brandonrobertz closed 5 years ago

brandonrobertz commented 5 years ago

I've been playing with a bunch of extractors and I encountered an issue that has confused me a bit. I'm playing with this DOM:

<li class="item event" >
  <div class="col-12 col-sm-2 event-type" >
    <h5 >
      Special Event
    </h5>
  </div>
    <div class="col-12 col-sm-7 item-content event-content" >
      <h3 class="title item-title event-title" >
        <a href="/events-and-training/event/3433/4377/" >Conference registration (Wednesday)</a>
      </h3>
        <p >Wednesday is a registration day.</p>
        <p >No talks scheduled.</p>
        <p ></p>
    </div>
  <div class="col-12 col-sm-3 item-meta event-meta" >
    <h4 class="event-location" >
      Salon EF
    </h4>
      <p  class="">
      3:00 pm - 6:00 pm
      </p>
  </div>
</li>

<li class="item event" >
  <div class="col-12 col-sm-2 event-type" >
    <h5 >
      Special Event
    </h5>
  </div>
    <div class="col-12 col-sm-7 item-content event-content" >
      <h3 class="title item-title event-title" >
        <a href="/events-and-training/event/3433/4378/">Conference sales (Wednesday)</a>
      </h3>
      <p ></p>
      <p >Stop by the conference sales table and browse our merchandise.</p>
      <p ></p>
    </div>
  <div class="col-12 col-sm-3 item-meta event-meta" >
    <h4 class="event-location" >
      Salon EF
    </h4>
      <p >
      3:00 pm - 6:00 pm
      </p>
  </div>
</li>

From it, I am looking to get a JSON representation like this:

{
    "BODY": [
        "Wednesday is a registration day.",
        "No talks scheduled.",
        ""
    ],
    "TITLE": "Conference registration (Wednesday)"
}
{
    "BODY": [
        "",
        "Stop by the conference sales table and browse our merchandise.",
        ""
    ],
    "TITLE": "Conference sales (Wednesday)"
}

My first thought was this:

<DIV >
  <h3><a @text:TITLE /></h3>
  <p @text:BODY />
</DIV>

But I get the first p tag, others ignored:

{
    "BODY": "Wednesday is a registration day.",
    "TITLE": "Conference registration (Wednesday)"
}
{
    "BODY": "",
    "TITLE": "Conference sales (Wednesday)"
}

I attempted with CSS nth-child selectors, but those selectors only seem to allow only a single reference (ranges like n+2 will only grab the second child, ignoring the rest):

``

# nth-child(n+2) throws an error!

The only way I can seem to get all of the `p` tags under `div` into `BODY` array is by omitting the `h3` tag:



Is this expected behavior? Is there a template I haven't thought of that can get both the `h3` text and an array of the sibling `p` tags under a `div`?

Thanks a lot!
thomastrapp commented 5 years ago

This will only match pairs of one <h3> and one <p>, repeatedly:

  <h3><a @text:TITLE /></h3>
  <p @text:BODY />

Making the <h3> optional (by adding a question mark) allows hext to match <p> without the heading:

<DIV class="item-content">
  <?h3> # <-- optional h3
    <a @text:TITLE />
  </h3>
  <p @text:BODY />
</DIV>

and will produce:

{
    "BODY": [
        "Wednesday is a registration day.",
        "No talks scheduled.",
        ""
    ],
    "TITLE": "Conference registration (Wednesday)"
}
{
    "BODY": [
        "",
        "Stop by the conference sales table and browse our merchandise.",
        ""
    ],
    "TITLE": "Conference sales (Wednesday)"
}

I hope this solves your problem.

I will think about this some more. The current rule matching might be too strict:


The hext template

<A/>
<B/>

means "Must match A and B, then repeat".


The hext template

<A/>
<?B/>

means "Must match A and optionally B, then repeat".


Maybe there should be a way to tell hext to repeat a rule for adjacent elements.

For example "Match <h3>, then match <p> and each adjacent <p>":

<h3 />
<p:repeat ... />
brandonrobertz commented 5 years ago

Thanks for the response and explanation. Unfortunately a div with a p is too general of a rule and will result in many unexpected records if the h3 isn't a filtering criteria.

An adjacent siblings repeat rule would be useful in this case. That said, I kind of expected p:nth-of-type(n) or p:nth-child(n+2) to grab all the adjacent p siblings, but it appears those selectors boil down to only a single catch in the above case. So I think the repeat rule is the answer here.

Is there anything I can do to help make this happen?

thomastrapp commented 5 years ago

Another solution came to mind:

<DIV >
  <h3><a @text:TITLE /></h3>
  <?p @text:BODY />
  <?p @text:BODY />
  <?p @text:BODY />
  <?p @text:BODY />
  <?p @text:BODY />
</DIV>

This will capture none or up to five paragraphs following the <h3>. But this is ugly \:)

That said, I kind of expected p:nth-of-type(n) or p:nth-child(n+2) to grab all the adjacent p siblings, but it appears those selectors boil down to only a single catch in the above case.

Yes, <p...> can only consume a single paragraph at each turn. I agree that nth-child(pattern), and the others, look like they might consume all elements that match pattern. This is unfortunate.


I really like the :repeat specifier. :repeat should behave like optional rules (after the first match), continually searching for a match until another rule takes precedence or until there are no more elements left. A rule with :repeat can also be made optional, that is, it may match none or many.

I will work on this in the next weeks.

Is there anything I can do to help make this happen?

Thank you very much for your offer :+1: I will let you know.

Edit: Removed outdated examples.

thomastrapp commented 5 years ago

Hext now has "greedy rules" (05c2fe7ff4056a38c85d2144dbf63cd213e4e829):

Rules may be greedy. A rule marked with a plus sign does not stop at the first match, instead it continually searches for a match until a mandatory rule takes precedence or until there are no more elements left. A greedy rule can also be made optional, that is, it may match none or many.

# match <h1>, followed by at least one <p>
<h1/><+p />
# match <h1>, followed by zero or more <p>
<h1/><?+p />

As a next step, I'll setup automated (or semi-automated) releases via travis for pip and npm.

@brandonrobertz To achieve the result in your example:

<DIV >
  <h3><a @text:TITLE /></h3>
  <+p @text:BODY />
</DIV>

Test case: hext-github-issue7.hext, hext-github-issue7.html, hext-github-issue7.expected.

brandonrobertz commented 5 years ago

Wow this incredible! This is going to help a lot of the scrapes I've been trying to do and facilitate the building of my Hext template building UI (https://github.com/brandonrobertz/hextractor/) Thank you!

thomastrapp commented 5 years ago

My pleasure!

thomastrapp commented 5 years ago

I have updated the python and npm package (0.2.1 and 10.0.3, respectively).

Automated builds must wait for a later date. Linux would be no problem, because I can use a docker image with all dependencies preinstalled. But I am not sure yet how to automate the Mac OS X builds (without building everything from source). Maybe automating the build for Python on Linux only is good enough for now.

@brandonrobertz I love what you are doing with Hext — I was particularly surprised by hext-emscripten. This would even make it possible to use Hext in a browser extension.