cli write: Error: Failed to write data: bufio.Scanner: token too long

russorat commented 4 years ago

I'm seeing the error: Error: Failed to write data: bufio.Scanner: token too long.

when i try to write a csv data that is ~164MB to influxdb. Based on https://stackoverflow.com/questions/21124327/how-to-read-a-text-file-line-by-line-in-go-when-some-lines-are-long-enough-to-ca , it looks like we might need to evaluate if we should switch to bufio.Reader? https://github.com/influxdata/influxdb/blob/f09ee881fb33ca06474ca499deac0c5cd1bd9f91/write/batcher.go#L67

I can send the file offline.

➜  influxdb git:(master) ✗ influx query -c tools 'from(bucket: "apps") |> range(start: -1m)' -r | influx write -c default --format csv -b apps
Error: Failed to write data: bufio.Scanner: token too long.

e-dard commented 4 years ago

@russorat the default buffer size for a bufio.Scanner is 64KB. An appropriately sized buffer can be passed to bufio.Scanner, which appears to be can happen in the code here:

https://github.com/influxdata/influxdb/blob/f09ee881fb33ca06474ca499deac0c5cd1bd9f91/write/batcher.go#L90-L93

However I think by default the Batcher is using the package default defined here, which is 500KB:

https://github.com/influxdata/influxdb/blob/f09ee881fb33ca06474ca499deac0c5cd1bd9f91/write/batcher.go#L16

So I'm guessing one needs to initialise a Batcher with MaxFlushBytes set to something appropriate for your data:

https://github.com/influxdata/influxdb/blob/f09ee881fb33ca06474ca499deac0c5cd1bd9f91/write/batcher.go#L26

rbetts commented 4 years ago

Just a note - the solution to this must not be to allocate the user's batch size as a single heap allocation - that's a pathway to crashing OOM.

stuartcarnie commented 4 years ago

I believe the appropriate fix is to improve the influx tool, to send batches to influxd. So, based on the following command invocation:

$ influx query -c tools 'from(bucket: "apps") |> range(start: -1m)' -r \
    | influx write -c default --format csv -b apps

I propose the fix is to improve:

influx write -c default --format csv -b apps

such that as it receives data from STDIN, it sends appropriately sized batches to influxd

@rbetts this might be a case where the swagger can read the limit annotations and adjust the batch-size accordingly.

rbetts commented 4 years ago

@stuartcarnie Yeah - if the client isn't chunking then it needs to. Seems like a place to start investigating.

Also need to improve the server's error message when the batch exceeds the internal limit.

stuartcarnie commented 4 years ago

The client is supposed to be batching writes, so the investigation will continue as to why it is generating the above error.

rogpeppe commented 4 years ago

If one of the field values is larger than the limit, then presumably it's not possible to split the write? Maybe that's what's going on.

stuartcarnie commented 4 years ago

Issue moved to OSS 2.0 GA

ssoroka commented 4 years ago

FYI, Scanner can take a buffer with a pre-allocated size and a max size, allowing you to pick a reasonable small size, and have it grow to a set max size. eg:

       scanner := bufio.NewScanner(r)
       scanBuf := make([]byte, 4096)
       scanner.Buffer(scanBuf, MaxBytes)

stuartcarnie commented 4 years ago

Dug in to this a bit more and batching should be working. I believe the output of

$ influx query -c tools 'from(bucket: "apps") |> range(start: -1m)' -r

is producing lines which exceed the scanners default of 64kb or not producing line-delimited text at all. That needs to be confirmed next. If 64kb is a normal length for a single line of CSV from the influx query command, we will need to increase the size of the scanner.Buffer as @ssoroka and others have stated.

Background

Some facts to support my hypothesis and the source of the bufio.Scanner: token too long. error.

A token is defined by the implementation of a bufio.SplitFunc

https://github.com/golang/go/blob/4c4a376736fff47b08ab6053605c3b68d87552b5/src/bufio/scan.go#L279

Batcher sets it to ScanLines, meaning a token is a single line:

https://github.com/influxdata/influxdb/blob/0ae8bebd755cf4aa579b9c38e6ea4792dfdbe53a/write/batcher.go#L68

NOTE: this call is redundant, as per the documentation, the default tokenizer for `bufio.NewScanner` is `ScanLines`

The error bufio.Scanner: token too long. comes from:

https://github.com/golang/go/blob/4c4a376736fff47b08ab6053605c3b68d87552b5/src/bufio/scan.go#L194

This error is generated when the buf exceeds maxTokenSize or half the magnitude of maxInt, which is very large for a 64 bit system. The condition is likely hitting maxTokenSize, which is 64kb by default:

https://github.com/golang/go/blob/4c4a376736fff47b08ab6053605c3b68d87552b5/src/bufio/scan.go#L80

Therefore the influx write command is receiving excessively long lines or text which is not properly line-delimited

russorat commented 4 years ago

is there a limit on the length of a line protocol record defined anywhere?

stuartcarnie commented 4 years ago

Turns out the panic is due to a bug in our LineReader, due to accessing an array out of bounds per the following panic:

github.com/influxdata/influxdb/v2/pkg/csv2lp.(*LineReader).Read(0xc000537380, 0xc000345392, 0xc6e, 0xc6e, 0x392, 0x0, 0x0)
    /Users/stuartcarnie/projects/go/influxdata/influxdbv2/pkg/csv2lp/line_reader.go:87 +0x24d
bufio.(*Reader).fill(0xc0005373e0)
    /Users/stuartcarnie/projects/go/google/golang/src/bufio/bufio.go:101 +0x105
bufio.(*Reader).ReadSlice(0xc0005373e0, 0x10000000000000a, 0x7c11558, 0x0, 0x69, 0x6453368, 0xc0001340e0)
    /Users/stuartcarnie/projects/go/google/golang/src/bufio/bufio.go:360 +0x3d
encoding/csv.(*Reader).readLine(0xc0003e6900, 0xc000495000, 0x69, 0xc0001340e0, 0x69, 0x0)
    /Users/stuartcarnie/projects/go/google/golang/src/encoding/csv/reader.go:218 +0x49
encoding/csv.(*Reader).readRecord(0xc0003e6900, 0xc000103800, 0x11, 0x17, 0xc000103800, 0x11, 0x17, 0x0, 0x0)
    /Users/stuartcarnie/projects/go/google/golang/src/encoding/csv/reader.go:266 +0xf2
encoding/csv.(*Reader).Read(0xc0003e6900, 0xc000103800, 0x11, 0x17, 0x0, 0x0)
    /Users/stuartcarnie/projects/go/google/golang/src/encoding/csv/reader.go:187 +0x57
github.com/influxdata/influxdb/v2/pkg/csv2lp.(*CsvToLineReader).Read(0xc00032eb40, 0xc000346761, 0x89f, 0x89f, 0x10, 0x1, 0x1)
    /Users/stuartcarnie/projects/go/influxdata/influxdbv2/pkg/csv2lp/csv2lp.go:105 +0x87
bufio.(*Scanner).Scan(0xc000727f08, 0xc000727e98)
    /Users/stuartcarnie/projects/go/google/golang/src/bufio/scan.go:214 +0xa9
github.com/influxdata/influxdb/v2/write.(*Batcher).read(0xc000118980, 0x52e7ca0, 0xc000498500, 0x52c8840, 0xc00032eb40, 0xc000406240, 0xc000537440)
    /Users/stuartcarnie/projects/go/influxdata/influxdbv2/write/batcher.go:69 +0xf0
created by github.com/influxdata/influxdb/v2/write.(*Batcher).Write
    /Users/stuartcarnie/projects/go/influxdata/influxdbv2/write/batcher.go:44 +0x1c7

The error occurs on the following line, because i == len(p) - 1:

https://github.com/influxdata/influxdb/blob/e1591077cd6809a85535c5148bfee516595f6fd6/pkg/csv2lp/line_reader.go#L87

rogpeppe commented 4 years ago

Looks like the actual bug is here where n is assigned to, but the rest of the code is assuming that n is the same as len(p).

I'd be inclined to avoid using n and just use len(p) when appropriate - it's no more efficient to assign it to a local variable.

stuartcarnie commented 4 years ago

@rogpeppe agree – that is exactly what I have done. Implemented a test which exhibits the out of bounds issue and have a fix pending.

russorat commented 4 years ago

How is that possible? that file is only 9 days old, but this issue is 11 days old? https://github.com/influxdata/influxdb/commits/master/pkg/csv2lp/line_reader.go

stuartcarnie commented 4 years ago

Well then! That means in my testing, I uncovered a new bug…

I did replicate and fix both that panic and:

bufio.Scanner: token too long.

russorat commented 4 years ago

i'm very curious to see this pr :)

influxdata / influxdb

cli write: Error: Failed to write data: bufio.Scanner: token too long #19586

Background