mgdm / htmlq

Like jq, but for HTML.
MIT License
6.99k stars 107 forks source link

noscript #40

Open mcnesium opened 2 years ago

mcnesium commented 2 years ago

Trying to get a list of currently available Invidious instances, I started doing

curl -s https://redirect.invidious.io | htmlq "noscript"

which gave me a list of all the noscript elements on the page, including the one I was looking for:

<noscript><div class="instances-list"><h2>Available instances</h2><ul class="list"><li><a href="https://invidious.snopyta.org">invidious.snopyta.org</a></li><li><a href="https://yewtu.be">yewtu.be</a></li><li><a href="https://invidious.kavin.rocks">invidious.kavin.rocks</a></li><li><a href="https://invidious-us.kavin.rocks">invidious-us.kavin.rocks</a></li><li><a href="https://invidious-jp.kavin.rocks">invidious-jp.kavin.rocks</a></li><li><a href="https://vid.puffyan.us">vid.puffyan.us</a></li><li><a href="https://invidious.namazso.eu">invidious.namazso.eu</a></li><li><a href="https://inv.riverside.rocks">inv.riverside.rocks</a></li><li><a href="https://vid.mint.lgbt">vid.mint.lgbt</a></li><li><a href="https://invidious.osi.kr">invidious.osi.kr</a></li><li><a href="https://invidio.xamh.de">invidio.xamh.de</a></li><li><a href="https://yt.artemislena.eu">yt.artemislena.eu</a></li></ul></div></noscript>

But when I tried to dig deeper to only get the list of URLs, it only gave me empty results, no matter what I tried:

$~ curl -s https://redirect.invidious.io | htmlq "noscript a"
$~ curl -s https://redirect.invidious.io | htmlq "noscript li"
$~ curl -s https://redirect.invidious.io | htmlq "noscript ul"
$~ curl -s https://redirect.invidious.io | htmlq "noscript div"

Is this an issue with noscript in general or with that specific site? Why does it find what I am looking for in the first place?

Using htmlq 0.4.0 from AUR

mcnesium commented 2 years ago

I know I can do

curl -s https://api.invidious.io/instances.json | jq -r '.[][1].uri'

because that is where the data from outside the noscript comes from, but this might still be a valid issue.

eporama commented 2 years ago

I wonder if this is because this uses servo/html5ever under the hood:

And it looks like that code may have an option for "scripting_enabled" which defaults to true and then makes noscript elements raw data

https://github.com/servo/html5ever/blob/57eb334c0ffccc6f88d563419f0fbeef6ff5741c/html5ever/src/tree_builder/rules.rs#L118-L126

I couldn't see where to set whether not you want to set that to false, but as a complete hack work around, you can do this for now:

$~ curl -s https://redirect.invidious.io | htmlq --text "noscript" | htmlq --attribute href .instances-list a