mgdm / htmlq

Like jq, but for HTML.
MIT License
7k stars 107 forks source link

Add option for converting relative href to absolute. #13

Closed Chaz6 closed 2 years ago

Chaz6 commented 2 years ago

In the example curl -s https://www.rust-lang.org/ | htmlq -a href a the links are output as-is, for example, /policies. In order to use this with other tools, it would be useful to make these links absolute. For example, curl -s https://www.rust-lang.org/ | htmlq -u https://www.rust-lang.org/ -a href a would results in https://www.rust-lang.org/policies (i.e. any relative href attributes are converted to absolute using the base url specified with -u).

lordmauve commented 2 years ago

Note that this also has to honour the \<base href=""> element if present.

marceloboeira commented 2 years ago

It would be interesting o check if that is an option already for the HTML parser/manipulation library, if it is we could just pipe the option to the parser if not we would have to build around it.

https://github.com/servo/html5ever

mgdm commented 2 years ago

I've added a PR from the base_urls branch that does this, if you'd like to try it out.

htmlq --base https://example.org will rewrite relative URLs according to that URL. htmlq --detect-base will try to find a base URL from the <base> element in the document. If not found, don't rewrite.

If you specify both, it will default to the base in the document, and fall back to the one supplied for --base if not found.