microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.12k stars 176 forks source link

SanitizeReaderToWriter is REALLY slow #134

Closed natefinch closed 2 years ago

natefinch commented 2 years ago

I haven't looked into this too deeply, but I went straight for this method because I assumed it might be better than reading the whole HTML body into memory first. All I did was

func main() {
    in, err := os.Open("input.html")
    if err != nil {
        panic(err)
    }
    defer in.Close()

    out, err := os.Create("output.html")
    if err != nil {
        panic(err)
    }
    defer out.Close()

    p := bluemonday.UGCPolicy()
    err = p.SanitizeReaderToWriter(in, out)
    if err != nil {
        panic(err)
    }
}

The difference between the above and reading the input into memory and writing the output to a buffer, and then writing the buffer to disk is HUGE. For a large (6mb) HTML file on my 2021 Macbook, SanitizeReaderToWriter took 4.5s, and Sanitize took 0.15s.

I haven't looked too far into this, and I get there could be some I/O buffering issue with reading and writing directly to disk, but even then, the fact that it's 30x slower seems really weird.

PaperPrototype commented 2 years ago

yeah...

buro9 commented 2 years ago

I'm confused by this because it's all the same under the hood:

func (p *Policy) sanitize(r io.Reader, w io.Writer) error {

SanitizeReaderToWriter is in fact the only one that goes directly: https://github.com/microcosm-cc/bluemonday/blob/main/sanitize.go#L68-L96

I found this puzzling, so I've added a test:

: go test -run TestIssue134 -v
=== RUN   TestIssue134
=== RUN   TestIssue134/Sanitize
=== RUN   TestIssue134/SanitizeReader
=== RUN   TestIssue134/SanitizeBytes
=== RUN   TestIssue134/SanitizeReaderToWriter
--- PASS: TestIssue134 (0.00s)
    --- PASS: TestIssue134/Sanitize (0.00s)
    --- PASS: TestIssue134/SanitizeReader (0.00s)
    --- PASS: TestIssue134/SanitizeBytes (0.00s)
    --- PASS: TestIssue134/SanitizeReaderToWriter (0.00s)
PASS
ok      github.com/microcosm-cc/bluemonday      0.002s

Please re-open if you can reproduce the test showing the method is slower than others.