microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.04k stars 178 forks source link

<a> tags in tables not matched correctly #201

Open matloob opened 7 months ago

matloob commented 7 months ago

Here's a program I ran: https://go.dev/play/p/iT5Rcoe1XRz copied below


import (
    "fmt"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    text := `
<a name="foo">
<table>
<tr>
<td><a href="page">text</a></td>
</tr>
</table>
</a>
`

    p := bluemonday.UGCPolicy()
    s := string(p.SanitizeBytes([]byte(text)))
    fmt.Println(s)
}

The output is


<table>
<tr>
<td><a href="page" rel="nofollow">text</td>
</tr>
</table>
</a>

But I would expect the output to be


<table>
<tr>
<td><a href="page" rel="nofollow">text</a></td>
</tr>
</table>

That is, the closing a tag in the td should be matched to the starting a tag in the td, but instead, the ending a tag in the text is matched to it.

buro9 commented 7 months ago

this doesn't feel like it is valid nor should work.

the a element is not allowed to be a descendent of another a element, and that is what you've done by wrapping the table in the outer <a name="foo"> (which btw is also deprecated/obsolete and you should use id instead of name).

by nesting the anchors you've broken the little state machine that keeps track of the tags opened and closed, and the last open anchor is preserved, as is the last close anchor.

to me, whilst yes it breaks the state machine, this isn't valid input... and bluemonday is GIGO (garbage in, garbage out) and so I wouldn't look to fix this.

that said, if the error can be reproduced on valid input then it would be a problem that I should fix.

I used https://validator.w3.org/nu/#textarea to validate the input as I did feel that nesting anchors shouldn't work (in addition to wrapping a table with an anchor)... as the only used of a name or id anchor is to navigate to the top of it, why not just close that immediately?