Closed impredicative closed 5 months ago
Would something like this work for your use case?
<body>
{ <*:tag-name-matches(/^(p|(h[1-6]+))$/) @text:content /> }
</body>
Yes, certainly, so long as it's compatible with the rest of hext.
I will implement this.
It will probably be part of the next Hext release, which is due shortly, in end of April/early May, because that's when Node 22 releases.
I am glad Hext is useful to you :)
Also, if you have any ideas for a "Hext Version 2", let me know.
For v2, rewite in Rust! I am only half joking. It may be hard if cycles are involved, but okay otherwise. Please see this chat. It may be worthwhile to rewrite to alleviate any fear of a memory leak, also perhaps to attract more contributors to Hext.
Rust, Zig, or whatever, are indeed superior languages to C++, but C++ is also fine. I am confident in the current code base. But, I should do some fuzzing. There might always be a memory bug, somewhere. If "Hext Version 2" will exist, and if vastly different; it will make sense to write it in a "safer" language.
Hext's HTML parser, Gumbo, is written in C, and needs to be replaced, because it is unmaintained, which is a shame (WTF Google?!). But it was the objectively the best HTML parser in 2015. Hext is approaching its 10 year anniversary :D
In your ChatGPT-Link; ChatGPT should have prefixed its answer with "It's fine, don't worry about it, but: [...]". It's not that hard to port to Rust :)
Hext's HTML parser, Gumbo, is written in C, and needs to be replaced, because it is unmaintained, which is a shame (WTF Google?!). But it was the objectively the best HTML parser in 2015. Hext is approaching its 10 year anniversary :D
Hello, I'd like to humbly mention that Gumbo is maintained, but not by the original author (by me, btw :). Code is at https://codeberg.org/gumbo-parser/gumbo-parser and some repos already took the new project as basis for a gumbo-parser package.
@txgk Thank you for your work! This is what FOSS is all about.
I noticed that Arch replaced upstream gumbo with your repository, because Hext suddenly failed to compile on my machine :D (which was fixed with https://github.com/html-extract/hext/commit/07d08ce8c8c1de8db56f13eaa96dd5bfc66e44bf , and now both versions of libgumbo can be linked into hext).
Gumbo is maintained, after all. One less thing for me to worry about, so thank you.
080e5b44e441b1d926633de4ed4a8b523943b777 adds a new trait: :type-matches(regex)
and will be part of the next release.
I have updated the editor on hext.thomastrapp.com; the new trait can be used there.
Matching all elements of type <p>
, <h1>
, <h2>
, <h3>
, <h4>
, <h5>
, or <h6>
.
Hext:
<*:type-matches(/^(p)$|^(h[1-6])$/) @text:content />
HTML:
<html>
<body>
<p>Content 1</p>
<h1>Content 2</h2>
<h3>Content 3</h3>
<div>Nope</div>
</body>
</html>
Result:
{
"content": "Content 1"
}
{
"content": "Content 2"
}
{
"content": "Content 3"
}
Matching HTML nodes, that have a tag with any value.
Hext:
<*:not(:type-matches(/^$/)) @text:result />
Custom tags, or tags that are unknown to the HTML parser, are not normalized. To alleviate this, the regex supports case-insensitive matching (notice the i
at the end of the regex).
Hext:
<*:type-matches(/^custom[-]?tag$/i) @text:content />
HTML:
<html>
<body>
<custom-tag>Content 1</custom-tag>
<customtag>Content 2</customtag>
<CustomTag>Content 3</CustomTag>
<div>Nope</div>
</body>
</html>
Result:
{
"content": "Content 1"
}
{
"content": "Content 2"
}
{
"content": "Content 3"
}
Thank you for your feature request. If this does not solve your issue, feel free to comment or open a new issue.
This hext feature is now in critical use [1] in my open source software newssurvey!
This hext feature is now in critical use [1] in my open source software newssurvey!
This hext feature is now in critical use [1] in my open source software newssurvey!
That's pretty cool!
This hext feature is now in critical use in my open source software newssurvey
Awesome!
I currently have:
Obvious this matches all
p
tags inbody
at any level. I however want something like:or more explicitly:
I mean I also want to match
h1
throughh6
, not justp
. This doesn't seem to be supported byhext
at this time. This is an important and urgent use case for me for extracting text from an HTML article for machine learning purposes. I don't however want to match any other tags at this time. Is there any way to do this?Currently, to use
hext
for this purpose, I have to first use a string replacement to replace allh1
-h6
tags withp
tags, which is a hacky thing to do via string manipulation, risking errors.