microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.2k stars 175 forks source link

Inline Images get stripped #51

Closed mstaack closed 6 years ago

mstaack commented 6 years ago

Using this content:

<img src="data:image/gif;base64,R0lGODdhEAAQAMwAAPj7+FmhUYjNfGuxYY
    DJdYTIeanOpT+DOTuANXi/bGOrWj6CONzv2sPjv2CmV1unU4zPgISg6DJnJ3ImTh8Mtbs00aNP1CZSGy0YqLEn47RgXW8amasW
    7XWsmmvX2iuXiwAAAAAEAAQAAAFVyAgjmRpnihqGCkpDQPbGkNUOFk6DZqgHCNGg2T4QAQBoIiRSAwBE4VA4FACKgkB5NGReAS
    FZEmxsQ0whPDi9BiACYQAInXhwOUtgCUQoORFCGt/g4QAIQA7">

and this policy:

    // Define a policy, we are using the UGC policy as a base.
    p := bluemonday.UGCPolicy()

    // Allow images to be embedded via data-uri
    p.AllowDataURIImages()

i get an empty string back.... whats wrong with my policy?

cheers max

grafana-dee commented 6 years ago

There is nothing wrong with the policy, but the image data is truncated or incomplete.

The data itself is parsed as base64 using https://github.com/microcosm-cc/bluemonday/blob/master/helpers.go#L223-L226

But the input is not valid https://play.golang.org/p/-eq8Xsr8Qy :

package main

import (
    "encoding/base64"
    "fmt"
)

func main() {
    s := `R0lGODdhEAAQAMwAAPj7+FmhUYjNfGuxYYDJdYTIeanOpT+DOTuANXi/bGOrWj6CONzv2sPjv2CmV1unU4zPgISg6DJnJ3ImTh8Mtbs00aNP1CZSGy0YqLEn47RgXW8amasW7XWsmmvX2iuXiwAAAAAEAAQAAAFVyAgjmRpnihqGCkpDQPbGkNUOFk6DZqgHCNGg2T4QAQBoIiRSAwBE4VA4FACKgkB5NGReASFZEmxsQ0whPDi9BiACYQAInXhwOUtgCUQoORFCGt/g4QAIQA7`

    if _, err := base64.StdEncoding.DecodeString(s); err != nil {
        // this one fires
        fmt.Println(err.Error())
    } else {
        fmt.Println("parsed as base64")
    }
}

Were you to provide a valid base64 input, it would work (following example based on the unit test):

package main

import (
    "fmt"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    p := bluemonday.NewPolicy()
    p.AllowImages()
    p.AllowDataURIImages()

    s := `<img src="data:image/gif;base64,R0lGODlhEAAQAMQAAORHHOVSKudfOulrSOp3WOyDZu6QdvCchPGolfO0o/XBs/fNwfjZ0frl3/zy7////wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACH5BAkAABAALAAAAAAQABAAAAVVICSOZGlCQAosJ6mu7fiyZeKqNKToQGDsM8hBADgUXoGAiqhSvp5QAnQKGIgUhwFUYLCVDFCrKUE1lBavAViFIDlTImbKC5Gm2hB0SlBCBMQiB0UjIQA7">`

    // this returns the image
    fmt.Printf("'%s'\n", p.Sanitize(s))
}
mstaack commented 6 years ago

@buro9 thanks for you feedback, will check base64 on my side ;)

mstaack commented 6 years ago

@buro9 weird this is my base64 img code: https://play.golang.org/p/O0j5Dmlvmp so its valid.

but my html response has an empty image tag: <img alt="">

this is my go code:

    // Define a policy, we are using the UGC policy as a base.
    p := bluemonday.UGCPolicy()

    // HTML email is often displayed in iframes and needs to preserve core
    // structure
    p.AllowDocType(true)

    // HTML email frequently contains obselete and basic HTML
    p.AllowElements("html", "head", "body", "label", "input", "font", "main", "nav", "header", "footer", "kbd", "legend", "map", "title")

    p.AllowAttrs("type").Matching(StyleType).OnElements("style")
    p.AllowAttrs("style").Globally()

    // Custom attributes on elements
    p.AllowAttrs("type", "media").OnElements("style")
    p.AllowAttrs("face", "size").OnElements("font")
    p.AllowAttrs("name", "content", "http-equiv").OnElements("meta")
    p.AllowAttrs("name", "data-id").OnElements("a")
    p.AllowAttrs("for").OnElements("label")
    p.AllowAttrs("type").OnElements("input")
    p.AllowAttrs("rel", "href").OnElements("link")
    p.AllowAttrs("topmargin", "leftmargin", "marginwidth", "marginheight", "yahoo").OnElements("body")
    p.AllowAttrs("xmlns").OnElements("html")

    p.AllowAttrs("style", "vspace", "hspace", "face", "bgcolor", "color", "border", "cellpadding", "cellspacing").Globally()

    // HTML email tends to see the use of obselete spacing and styling attributes
    p.AllowAttrs("bgcolor", "color", "align").OnElements("basefont", "font", "hr", "table", "td")
    p.AllowAttrs("border").OnElements("img", "table", "basefont", "font", "hr", "td")
    p.AllowAttrs("cellpadding", "cellspacing", "valign", "halign").OnElements("table")

    // Allow "class" attributes on all elements
    p.AllowStyling()

    // Allow images to be embedded via data-uri
    p.AllowDataURIImages()

    // Add "rel=nofollow" to links
    p.RequireNoFollowOnLinks(true)
    p.RequireNoFollowOnFullyQualifiedLinks(true)

    // Open external links in a new window/tab
    p.AddTargetBlankToFullyQualifiedLinks(true)

    // Read input from stdin so that this is a nice unix utility and can receive
    // piped input
    dirty, err := ioutil.ReadAll(os.Stdin)
    if err != nil {
        log.Fatal(err)
    }

    // Apply the policy and write to stdout
    fmt.Fprint(
        os.Stdout,
        p.Sanitize(
            string(dirty),
        ),
    )
grafana-dee commented 6 years ago

Hmmm, if I take your image example from the linked Go Playground... it works for me. When I use the base64 input within a src="" if does not work. If I remove the line breaks and whitespace... it works for me.

Do you have whitespace or line breaks in your input data URL content?

mstaack commented 6 years ago

ahh yeah...okay image

mstaack commented 6 years ago

but i cant just replace all \n to an empty string

mstaack commented 6 years ago

hmm okay fixed the base64 part... image

but this still gives an empty image tag... weird

grafana-dee commented 6 years ago

This is a valid bug... I'll fix it :)

Line feeds in data URI attributes in HTML5 are valid, so I should be stripping those out beforehand.

mstaack commented 6 years ago

@buro9 your work is awesome! will recheck when your done. thanks!

grafana-dee commented 6 years ago

Thanks... do you have a test file btw... with the full input you are using (for this part of the file)... to avoid me copying it from a screenshot :)

mstaack commented 6 years ago

https://gist.github.com/mstaack/e677dc7216c9cb3d863d23ab00c37bd3

mstaack commented 6 years ago

and one with newline: https://gist.github.com/mstaack/599a28114ceecc45e8f5f6d93f7b7d71

mstaack commented 6 years ago

both should work

grafana-dee commented 6 years ago

These would still return empty string because image/image/png is not a valid content-type.

But assuming image/png is used I'll make the new line one work too.

mstaack commented 6 years ago

yeah you are totally right... will fix that on my side ;)