johnkerl / miller

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
https://miller.readthedocs.io
Other
8.92k stars 216 forks source link

TSV cell-length limits #1088

Open cmcfadden opened 2 years ago

cmcfadden commented 2 years ago

It seems like there's a 65536 char limit on the size of a cell in a TSV file. Is there a technical / spec reason to cap it? (We get files from the Canvas learning management system with very large TSV cells)

cmcfadden commented 2 years ago

Ah I see - has to do with bufio scanner having a default max token length of 64k. Upping the buffer in record_reader.go seems to fix it.

bufferSize := 1024 * 1024
scannerBuffer := make([]byte, bufferSize)
scanner.Buffer(scannerBuffer, bufferSize)
johnkerl commented 1 year ago

@cmcfadden Great find, thanks!!!

Bummer about bufio's max token length :(

I think maybe the best thing is to expose a command-line option so people can override this at need ...

skochvi commented 9 months ago

Thank you @cmcfadden for this! This bug was really confusing me.

Ah I see - has to do with bufio scanner having a default max token length of 64k. Upping the buffer in record_reader.go seems to fix it.

bufferSize := 1024 * 1024
scannerBuffer := make([]byte, bufferSize)
scanner.Buffer(scannerBuffer, bufferSize)

I looked into the issue further. In order to fix the issue, you don't actually need to allocate a buffer whose size is greater than the Scanner's default buffer size.

From the bufio documentation: https://pkg.go.dev/bufio#Scanner.Buffer

Buffer sets the initial buffer to use when scanning and the maximum size of buffer that may be allocated during scanning. The maximum token size is the larger of max and cap(buf). If max <= cap(buf), Scan will use this buffer only and do no allocation.

By default, Scan uses an internal buffer and sets the maximum token size to MaxScanTokenSize.

In the go source code, in scan.go, the default buffer size for a Scanner is set to 4096 bytes:

const (
    // MaxScanTokenSize is the maximum size used to buffer a token
    // unless the user provides an explicit buffer with Scanner.Buffer.
    // The actual maximum token size may be smaller as the buffer
    // may need to include, for instance, a newline.
    MaxScanTokenSize = 64 * 1024

    startBufSize = 4096 // Size of initial allocation for buffer.
)

startBufSize is used in the Scan function's allocation of the initial buffer:

newSize := len(s.buf) * 2
if newSize == 0 {
    newSize = startBufSize
}
if newSize > s.maxTokenSize {
    newSize = s.maxTokenSize
}
newBuf := make([]byte, newSize)

Unfortunately, startBufSize is unexported, and I guess could change in theory, but if you want to keep the behavior essentially the same with just an increased limit on the maxTokenSize, you have to hardcode 4096 as the buffer size; you can then increase the token size limit to whatever you want and subsequent allocations will proceed (amortized over inserts) up to that size.

bufferSize := 4096
myMaxTokenSize := 1024 * 1024
scannerBuffer := make([]byte, bufferSize)
scanner.Buffer(scannerBuffer, myMaxTokenSize)