golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
121.18k stars 17.37k forks source link

proposal: x/net/html: ParseOption to set maxBuf #68101

Open Jarcis-cy opened 2 weeks ago

Jarcis-cy commented 2 weeks ago

Proposal Details

Abstract

This proposal suggests introducing an option to set the MaxBuf parameter in the html.Parse function to control memory usage when parsing large HTML documents.

Background

Currently, html.Parse in the Go standard library calls ParseWithOptions internally, leading to a chain of function calls: html.Parse -> ParseWithOptions -> p.parse() -> p.tokenizer.Next() -> readByte(). Within readByte(), there is a logic block:

if z.maxBuf > 0 && z.raw.end-z.raw.start >= z.maxBuf {
    z.err = ErrBufferExceeded
    return 0
}

This logic is activated only if maxBuf is set. However, there is no way to set MaxBuf when using html.Parse or ParseWithOptions.

Problem

When parsing very large HTML documents, such as this page, memory usage can increase significantly due to the inability to set MaxBuf.

Solution

To address this, I propose introducing a function similar to ParseOptionEnableScripting to allow users to set MaxBuf.

Implementation

A sample implementation using reflection is provided below. This implementation, though functional, uses unsafe methods and reflection, which are not ideal for production code:

func ParseOptionSetMaxBuf(maxBuf int) html.ParseOption {
    funcValue := reflect.MakeFunc(
        reflect.FuncOf([]reflect.Type{reflect.TypeOf((*html.ParseOption)(nil)).Elem().In(0)}, nil, false),
        func(args []reflect.Value) (results []reflect.Value) {
            parserValue := args[0].Elem()
            tokenizerField := parserValue.FieldByName("tokenizer")
            tokenizerPtr := reflect.NewAt(tokenizerField.Type(), unsafe.Pointer(tokenizerField.UnsafeAddr())).Elem().Interface()
            if tokenizer, ok := tokenizerPtr.(interface { SetMaxBuf(int) }); ok {
                tokenizer.SetMaxBuf(maxBuf)
            }
            return nil
        },
    )
    var option html.ParseOption
    reflect.ValueOf(&option).Elem().Set(funcValue)
    return option
}

This implementation can be used as follows:

html.ParseWithOptions(bytes.NewReader(data), util.ParseOptionSetMaxBuf(len(data)*3))

To properly address the issue, I propose the following function to be added to the standard library:

func ParseOptionSetTokenizerMaxBuf(maxBuf int) ParseOption {
    return func(p *parser) {
        p.tokenizer.SetMaxBuf(maxBuf)
    }
}

Testing has shown that setting maxBuf to at least 1.04 times the body length ensures normal operation.

Feasibility

Adding a function similar to ParseOptionEnableScripting to allow users to set MaxBuf would provide a safe and efficient way to control memory usage when parsing large HTML documents, avoiding the use of unsafe methods and reflection.

Environment

seankhliao commented 2 weeks ago

Related: #63177 to set the entire Tokenizer

ianlancetaylor commented 2 weeks ago

CC @neild @bradfitz