jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.88k stars 2.17k forks source link

Unbound prefixes not handled #1341

Open SimonSchmid opened 4 years ago

SimonSchmid commented 4 years ago

Hello, I want to report an issue I am having with jsoup. I have not found a similar issue, so I am creating a new one.

I created a toy example that illustrates the issue:

<!doctype html>
<html lang="de">
    <head>

    </head>
    <body>
    <test:h1>UnboundPrefix</test:h1>
    <svg width="180" height="180" xlink:href="UnboundPrefix">
            <rect x="20" y="20" rx="20" ry="20" width="100" height="100" style="fill:lightgray; stroke:#1c87c9; stroke-width:4;"/>
        </svg>
    </body>
</html>

This webpage contains two unbound prefixes, one in within a tag and one within an attribute. Jsoup does not handle these according to https://html.spec.whatwg.org/#creating-and-inserting-nodes and https://html.spec.whatwg.org/#coercing-an-html-dom-into-an-infoset. There it says, the first case (tag) should be handled as follows: <test:h1> becomes <testU00003Ah1>. The second case is handled by adding the xlink namespace to the html tag.

Without the unbound prefixes being fixed, I have issues using XPath. It would be nice if jsoup handles such cases.

Regards, Simon

SimonSchmid commented 4 years ago

Is this something that will be addressed anytime soon?

lexamxu commented 3 years ago

Hi, we are a student group and we would like to fix this bug. Can't guarantee that we are able to fix it but we would like to have a try.

duanyang25 commented 2 years ago

Hi @SimonSchmid. I am an undergraduate student. One of my courses this semester related to Software Engineering requires us to fix issues on Github.

I can understand the first case, but I am confusing with the second case "one within an attribute". May I ask what is the expected output for the second case? Could you explain a little bit about "The second case is handled by adding the xlink namespace to the html tag."? Thank you very much.

The second case that I understand is xlink:href="UnboundPrefix". So you want to access the value UnboundPrefix with the name xlink:href, right?

I am currently working on converting : to Unicode so that Jsoup can give the name containing it for the first case. But I may need more information about the second case.

I now understand what you want for the second case from the link you provided https://html.spec.whatwg.org/#coercing-an-html-dom-into-an-infoset. You may want to search the attribute by the key "xlinkU00003Ahref" rather than "xlink:href". Please take a look at PR #1682.

jhy commented 1 month ago

Hi @SimonSchmid, sorry for the late reply on this. Can you give more detail / an example on what you want to do with the xpath selector and how you're interacting with that. I want to make sure I understand the use case correctly.

In #1801 we disabled the namepath for elements when running through the xpath selector, for general convenience.

So e.g. el.selectXpath("//h1") finds the first example. See example.