mgdm / htmlq

Like jq, but for HTML.
MIT License
7.13k stars 113 forks source link

Support selecting by element's inner text/content #55

Open noperator opened 2 years ago

noperator commented 2 years ago

Love htmlq—use it daily. Thanks for the effort you've put into this tool.

I have a pretty frequent need to find an element based on the text that it contains. For example, I'd like to match on the first div (i.e., the one containing Item 1).

<div>Item 1</div>
<div>Item 2</div>
<div>Item 3</div>

There have been various attempts at updating the official CSS specification to support this kind of functionality, though I don't think any of them have actually made their way into the spec. Instead, various tools (Playwright. etc.) extend their own CSS selectors to support one of the following forms.

div:contains("Item 1")
div:has-text("Item 1")
div[innerText="Item 1"]
div[textContent="Item 1"]

It would be very useful if I could use the following CSS selector with htmlq to match on the previously noted element.

htmlq 'div:contains("Item 1")'

To my understanding, htmlq relies on kuchiki for HTML parsing, which in turn relies on servo for CSS selection—so I think I'd need to request support for this upstream in servo. Does that seem right to you? Just wanted to run this idea past you in case you've thought about it already and/or have an opinion about it.

harri-halttunen-aktia commented 1 year ago

I highly support the idea of selecting elements based on their content. While it is true that :contains() did not make it way to CSS3 (as it will break the whole idea of separating structure and content) it would be extremely important feature for tools like htmlqor pup as the content to be filtered cannot be controlled by the user of these tools.

Acutally, pup has this feature but it unfortunately does not have a general sibling selector which htmlq has. It would be nice to have both features.

noperator commented 1 year ago

Agree. I'm kind of doubtful that a widely used package like servo would accept a PR for a nonstandard CSS selector like :contains(). More realistic option might be to find a way to implement it downstream in kuchiki, or directly within htmlq.

From the developer behind kuchiki:

It is possible [to support pseudo-class selectors]. I have no plan to work on this, however.

Later on, :contains() was also explicitly requested in kuchiki, which was met with the reply:

:contains is not part of CSS: https://drafts.csswg.org/selectors/. I’m not even sure what it’s supposed to do.

Wonder if kuchiki would accept a working PR for :contains(); it does support a number of valid pseudo-classes.


Also, for reference, the full list of selectors that pup implements.

noperator commented 1 year ago

If it's too hard to get this functionality implemented upstream as a pseudo-class selector, then we could alternatively add a CLI option instead:

OPTIONS:
    -a, --attribute <attribute>         Only return this attribute (if present) from selected elements
    -b, --base <base>                   Use this URL as the base for links
 👉 -c, --contains <REGEX>              Return only selected elements whose whose text nodes match this regular expression
    -f, --filename <FILE>               The input file. Defaults to stdin
    -o, --output <FILE>                 The output file. Defaults to stdout
    -r, --remove-nodes <SELECTOR>...    Remove nodes matching this expression before output. May be specified multiple

I'm suggesting -c, --contains since :contains() seems to be the most common form that the non-standard pseudo-class selector takes—but something like -m, --match could make sense, too. This seems to align with other options like --attribute and --remove-nodes that post-process the HTML with non-standard selection operations before finally returning.

noperator commented 1 year ago

Drafted a change as proposed above. Given the following HTML sample from https://lethain.com/company-team-self/:

<li class="mb2">
  <a href="/work-hard-work-smart/">
    Work hard / work smart.</a>
</li>
<li class="mb2">
  <a href="/mailbag-not-measurable-whether-hire-exec/">  👈
    Mailbag: What isn't measurable? To hire as exec or not?</a>
</li>
<li class="mb2">
  <a href="/reminiscing/">
    Reminiscing: the retreat to comforting work.</a>
</li>

You can find the hyperlink list item li.mb2 > a whose name matches the case-insensitive regex (?i)mailbag.*measure?, extract the href, and prepend https://example.com to its base URL.

curl -s https://lethain.com/company-team-self/ |
    htmlq -c '(?i)mailbag.*measure?' -a href -b https://example.com 'li.mb2 > a'

https://example.com/mailbag-not-measurable-whether-hire-exec/