Closed brandonrobertz closed 5 years ago
This will only match pairs of one <h3>
and one <p>
, repeatedly:
<h3><a @text:TITLE /></h3>
<p @text:BODY />
Making the <h3>
optional (by adding a question mark) allows hext to match <p>
without the heading:
<DIV class="item-content">
<?h3> # <-- optional h3
<a @text:TITLE />
</h3>
<p @text:BODY />
</DIV>
and will produce:
{
"BODY": [
"Wednesday is a registration day.",
"No talks scheduled.",
""
],
"TITLE": "Conference registration (Wednesday)"
}
{
"BODY": [
"",
"Stop by the conference sales table and browse our merchandise.",
""
],
"TITLE": "Conference sales (Wednesday)"
}
I hope this solves your problem.
I will think about this some more. The current rule matching might be too strict:
The hext template
<A/>
<B/>
means "Must match A and B, then repeat".
The hext template
<A/>
<?B/>
means "Must match A and optionally B, then repeat".
Maybe there should be a way to tell hext to repeat a rule for adjacent elements.
For example "Match <h3>
, then match <p>
and each adjacent <p>
":
<h3 />
<p:repeat ... />
Thanks for the response and explanation. Unfortunately a div
with a p
is too general of a rule and will result in many unexpected records if the h3
isn't a filtering criteria.
An adjacent siblings repeat
rule would be useful in this case. That said, I kind of expected p:nth-of-type(n)
or p:nth-child(n+2)
to grab all the adjacent p
siblings, but it appears those selectors boil down to only a single catch in the above case. So I think the repeat
rule is the answer here.
Is there anything I can do to help make this happen?
Another solution came to mind:
<DIV >
<h3><a @text:TITLE /></h3>
<?p @text:BODY />
<?p @text:BODY />
<?p @text:BODY />
<?p @text:BODY />
<?p @text:BODY />
</DIV>
This will capture none or up to five paragraphs following the <h3>
. But this is ugly \:)
That said, I kind of expected p:nth-of-type(n) or p:nth-child(n+2) to grab all the adjacent p siblings, but it appears those selectors boil down to only a single catch in the above case.
Yes, <p...>
can only consume a single paragraph at each turn. I agree that nth-child(pattern)
, and the others, look like they might consume all elements that match pattern
. This is unfortunate.
I really like the :repeat
specifier.
:repeat
should behave like optional rules (after the first match), continually searching for a match until another rule takes precedence or until there are no more elements left.
A rule with :repeat
can also be made optional, that is, it may match none or many.
I will work on this in the next weeks.
Is there anything I can do to help make this happen?
Thank you very much for your offer :+1: I will let you know.
Edit: Removed outdated examples.
Hext now has "greedy rules" (05c2fe7ff4056a38c85d2144dbf63cd213e4e829):
Rules may be greedy. A rule marked with a plus sign does not stop at the first match, instead it continually searches for a match until a mandatory rule takes precedence or until there are no more elements left. A greedy rule can also be made optional, that is, it may match none or many.
# match <h1>, followed by at least one <p>
<h1/><+p />
# match <h1>, followed by zero or more <p>
<h1/><?+p />
As a next step, I'll setup automated (or semi-automated) releases via travis for pip and npm.
@brandonrobertz To achieve the result in your example:
<DIV >
<h3><a @text:TITLE /></h3>
<+p @text:BODY />
</DIV>
Test case: hext-github-issue7.hext, hext-github-issue7.html, hext-github-issue7.expected.
Wow this incredible! This is going to help a lot of the scrapes I've been trying to do and facilitate the building of my Hext template building UI (https://github.com/brandonrobertz/hextractor/) Thank you!
My pleasure!
I have updated the python and npm package (0.2.1 and 10.0.3, respectively).
Automated builds must wait for a later date. Linux would be no problem, because I can use a docker image with all dependencies preinstalled. But I am not sure yet how to automate the Mac OS X builds (without building everything from source). Maybe automating the build for Python on Linux only is good enough for now.
@brandonrobertz I love what you are doing with Hext — I was particularly surprised by hext-emscripten. This would even make it possible to use Hext in a browser extension.
I've been playing with a bunch of extractors and I encountered an issue that has confused me a bit. I'm playing with this DOM:
From it, I am looking to get a JSON representation like this:
My first thought was this:
But I get the first
p
tag, others ignored:I attempted with CSS
nth-child
selectors, but those selectors only seem to allow only a single reference (ranges liken+2
will only grab the second child, ignoring the rest):``