Open noperator opened 2 years ago
I highly support the idea of selecting elements based on their content. While it is true that :contains()
did not make it way to CSS3 (as it will break the whole idea of separating structure and content) it would be extremely important feature for tools like htmlq
or pup
as the content to be filtered cannot be controlled by the user of these tools.
Acutally, pup
has this feature but it unfortunately does not have a general sibling selector which htmlq
has. It would be nice to have both features.
Agree. I'm kind of doubtful that a widely used package like servo
would accept a PR for a nonstandard CSS selector like :contains()
. More realistic option might be to find a way to implement it downstream in kuchiki
, or directly within htmlq
.
From the developer behind kuchiki
:
It is possible [to support pseudo-class selectors]. I have no plan to work on this, however.
Later on, :contains()
was also explicitly requested in kuchiki
, which was met with the reply:
:contains
is not part of CSS: https://drafts.csswg.org/selectors/. I’m not even sure what it’s supposed to do.
Wonder if kuchiki
would accept a working PR for :contains()
; it does support a number of valid pseudo-classes.
Also, for reference, the full list of selectors that pup
implements.
If it's too hard to get this functionality implemented upstream as a pseudo-class selector, then we could alternatively add a CLI option instead:
OPTIONS:
-a, --attribute <attribute> Only return this attribute (if present) from selected elements
-b, --base <base> Use this URL as the base for links
👉 -c, --contains <REGEX> Return only selected elements whose whose text nodes match this regular expression
-f, --filename <FILE> The input file. Defaults to stdin
-o, --output <FILE> The output file. Defaults to stdout
-r, --remove-nodes <SELECTOR>... Remove nodes matching this expression before output. May be specified multiple
I'm suggesting -c, --contains
since :contains()
seems to be the most common form that the non-standard pseudo-class selector takes—but something like -m, --match
could make sense, too. This seems to align with other options like --attribute
and --remove-nodes
that post-process the HTML with non-standard selection operations before finally returning.
Drafted a change as proposed above. Given the following HTML sample from https://lethain.com/company-team-self/:
<li class="mb2">
<a href="/work-hard-work-smart/">
Work hard / work smart.</a>
</li>
<li class="mb2">
<a href="/mailbag-not-measurable-whether-hire-exec/"> 👈
Mailbag: What isn't measurable? To hire as exec or not?</a>
</li>
<li class="mb2">
<a href="/reminiscing/">
Reminiscing: the retreat to comforting work.</a>
</li>
You can find the hyperlink list item li.mb2 > a
whose name matches the case-insensitive regex (?i)mailbag.*measure?
, extract the href
, and prepend https://example.com
to its base URL.
curl -s https://lethain.com/company-team-self/ |
htmlq -c '(?i)mailbag.*measure?' -a href -b https://example.com 'li.mb2 > a'
https://example.com/mailbag-not-measurable-whether-hire-exec/
Love
htmlq
—use it daily. Thanks for the effort you've put into this tool.I have a pretty frequent need to find an element based on the text that it contains. For example, I'd like to match on the first
div
(i.e., the one containingItem 1
).There have been various attempts at updating the official CSS specification to support this kind of functionality, though I don't think any of them have actually made their way into the spec. Instead, various tools (Playwright. etc.) extend their own CSS selectors to support one of the following forms.
It would be very useful if I could use the following CSS selector with
htmlq
to match on the previously noted element.To my understanding,
htmlq
relies onkuchiki
for HTML parsing, which in turn relies onservo
for CSS selection—so I think I'd need to request support for this upstream inservo
. Does that seem right to you? Just wanted to run this idea past you in case you've thought about it already and/or have an opinion about it.