html-extract / hext

Domain-specific language for extracting structured data from HTML documents
https://hext.thomastrapp.com
Apache License 2.0
51 stars 3 forks source link

Matching any of multiple tags #30

Closed impredicative closed 5 months ago

impredicative commented 5 months ago

I currently have:

<body>
    { <p @text:content /> }
</body>

Obvious this matches all p tags in body at any level. I however want something like:

<body>
    { <p|h[1-6] @text:content /> }
</body>

or more explicitly:

<body>
    { <p|h1|h2|h3|h4|h5|h6 @text:content /> }
</body>

I mean I also want to match h1 through h6, not just p. This doesn't seem to be supported by hext at this time. This is an important and urgent use case for me for extracting text from an HTML article for machine learning purposes. I don't however want to match any other tags at this time. Is there any way to do this?

Currently, to use hext for this purpose, I have to first use a string replacement to replace all h1-h6 tags with p tags, which is a hacky thing to do via string manipulation, risking errors.

thomastrapp commented 5 months ago

Would something like this work for your use case?

<body>
  { <*:tag-name-matches(/^(p|(h[1-6]+))$/) @text:content /> }
</body>
impredicative commented 5 months ago

Yes, certainly, so long as it's compatible with the rest of hext.

thomastrapp commented 5 months ago

I will implement this.

It will probably be part of the next Hext release, which is due shortly, in end of April/early May, because that's when Node 22 releases.

I am glad Hext is useful to you :)

Also, if you have any ideas for a "Hext Version 2", let me know.

impredicative commented 5 months ago

For v2, rewite in Rust! I am only half joking. It may be hard if cycles are involved, but okay otherwise. Please see this chat. It may be worthwhile to rewrite to alleviate any fear of a memory leak, also perhaps to attract more contributors to Hext.

thomastrapp commented 5 months ago

Rust, Zig, or whatever, are indeed superior languages to C++, but C++ is also fine. I am confident in the current code base. But, I should do some fuzzing. There might always be a memory bug, somewhere. If "Hext Version 2" will exist, and if vastly different; it will make sense to write it in a "safer" language.

Hext's HTML parser, Gumbo, is written in C, and needs to be replaced, because it is unmaintained, which is a shame (WTF Google?!). But it was the objectively the best HTML parser in 2015. Hext is approaching its 10 year anniversary :D

In your ChatGPT-Link; ChatGPT should have prefixed its answer with "It's fine, don't worry about it, but: [...]". It's not that hard to port to Rust :)

txgk commented 5 months ago

Hext's HTML parser, Gumbo, is written in C, and needs to be replaced, because it is unmaintained, which is a shame (WTF Google?!). But it was the objectively the best HTML parser in 2015. Hext is approaching its 10 year anniversary :D

Hello, I'd like to humbly mention that Gumbo is maintained, but not by the original author (by me, btw :). Code is at https://codeberg.org/gumbo-parser/gumbo-parser and some repos already took the new project as basis for a gumbo-parser package.

thomastrapp commented 5 months ago

@txgk Thank you for your work! This is what FOSS is all about.

I noticed that Arch replaced upstream gumbo with your repository, because Hext suddenly failed to compile on my machine :D (which was fixed with https://github.com/html-extract/hext/commit/07d08ce8c8c1de8db56f13eaa96dd5bfc66e44bf , and now both versions of libgumbo can be linked into hext).

Gumbo is maintained, after all. One less thing for me to worry about, so thank you.

thomastrapp commented 5 months ago

080e5b44e441b1d926633de4ed4a8b523943b777 adds a new trait: :type-matches(regex) and will be part of the next release. I have updated the editor on hext.thomastrapp.com; the new trait can be used there.

Example 1

Matching all elements of type <p>, <h1>, <h2>, <h3>, <h4>, <h5>, or <h6>.

Hext:

<*:type-matches(/^(p)$|^(h[1-6])$/) @text:content />

HTML:

<html>
  <body>
    <p>Content 1</p>
    <h1>Content 2</h2>
    <h3>Content 3</h3>
    <div>Nope</div>
  </body>
</html>

Result:

{
    "content": "Content 1"
}
{
    "content": "Content 2"
}
{
    "content": "Content 3"
}

Example 2

Matching HTML nodes, that have a tag with any value.

Hext:

<*:not(:type-matches(/^$/)) @text:result />

Example 3

Custom tags, or tags that are unknown to the HTML parser, are not normalized. To alleviate this, the regex supports case-insensitive matching (notice the i at the end of the regex).

Hext:

<*:type-matches(/^custom[-]?tag$/i) @text:content />

HTML:

<html>
  <body>
    <custom-tag>Content 1</custom-tag>
    <customtag>Content 2</customtag>
    <CustomTag>Content 3</CustomTag>
    <div>Nope</div>
  </body>
</html>

Result:

{
    "content": "Content 1"
}
{
    "content": "Content 2"
}
{
    "content": "Content 3"
}

Thank you for your feature request. If this does not solve your issue, feel free to comment or open a new issue.

impredicative commented 3 weeks ago

This hext feature is now in critical use [1] in my open source software newssurvey!

brandonrobertz commented 3 weeks ago

This hext feature is now in critical use [1] in my open source software newssurvey!

This hext feature is now in critical use [1] in my open source software newssurvey!

That's pretty cool!

thomastrapp commented 3 weeks ago

This hext feature is now in critical use in my open source software newssurvey

Awesome!