Closed przmv closed 8 years ago
Explain how it helps plz. Why it is better?
When the log file line is longer than bufio.Scanner
buffer (MaxScanTokenSize = 64 * 1024
) it returns ErrTooLong
. That means, that you can't process lines longer than 64KB with bufio.Scanner
.
With bufio.Reader
you can check if the buffer contains the whole line, if not you can't do one more buffered read.
So, this solution is better because it allows to process files with long (> 64KB) lines.
Oh man... logs with over 64Kb lines long :/ What are you parsing?
Could you make a bench on Scanner vs Reader? Just to be sure we have no valuable speed degradation on normal size logs.
There's no speed difference between Scanner
and Reader
:
$ make bench
go test -bench .
..................................
34 total assertions
..............
48 total assertions
.........
57 total assertions
.....
62 total assertions
........................................
102 total assertions
PASS
BenchmarkParseSimpleLogRecord-4 200000 6291 ns/op
BenchmarkParseLogRecord-4 50000 30310 ns/op
BenchmarkScannerReader-4 2000000000 0.00 ns/op
BenchmarkReaderReader-4 2000000000 0.00 ns/op
ok github.com/pshevtsov/gonx 3.275s
Oh man... logs with over 64Kb lines long :/ What are you parsing?
Yup. It's not the ordinary log file but some huge TSV dump with a few such long lines.
Benchmark tests looks strange. First of all bench tests should measure one single operation (read line, in our case), but you read whole file. How mane lines in this file? And at the end, results are a little confusing... 0 ns/op?
How many lines in this file?
It reads this file. But all right, I'm going to redo the test cases to read just one line.
And at the end, results are a little confusing... 0 ns/op?
Yes, I also found it a bit weird. Probably that's because the whole file was loaded into memory. I'm going to check.
Oh, I definetely need more sleep (or more :coffee: )
I've just corrected the benchmarks:
BenchmarkScannerReader-4 500000 2040 ns/op
BenchmarkReaderReader-4 500000 2206 ns/op
So we get 2.0 ms to read line with Scanner
and 2.2 to read with Reader
. 10% slower on reading.
$ go test -bench Reader -benchmem -benchtime 1m
..................................
34 total assertions
..............
48 total assertions
.........
57 total assertions
.....
62 total assertions
........................................
102 total assertions
PASS
BenchmarkScannerReader-4 50000000 2296 ns/op 4096 B/op 1 allocs/op
BenchmarkReaderReaderAppend-4 50000000 2625 ns/op 4096 B/op 1 allocs/op
BenchmarkReaderReaderBuffer-4 30000000 2818 ns/op 4208 B/op 2 allocs/op
ok github.com/pshevtsov/gonx 392.334s
Good job! Do you want to add something or we can merge it?
Please hold on for some time. I'm working on some improvements — e.g. no need for a loop when reading short lines — so both techniques will work the same for the most common use cases (i.e. reading short lines). Also I'd like to test which technique (append
or bytes.Buffer
) will work better for long lines.
I'll let you know soon. Cheers
Hey,
It turns out that for reading long lines using bytes.Buffer
is better:
BenchmarkScannerReader-4 500000 2679 ns/op 4305 B/op 5 allocs/op
BenchmarkReaderReaderAppend-4 500000 2650 ns/op 4305 B/op 5 allocs/op
BenchmarkReaderReaderBuffer-4 500000 2640 ns/op 4305 B/op 5 allocs/op
BenchmarkLongReaderReaderAppend-4 500 2588516 ns/op 5666680 B/op 29 allocs/op
BenchmarkLongReaderReaderBuffer-4 1000 2276376 ns/op 4076730 B/op 20 allocs/op
Hey @satyrius when are you going to merge this PR? I have a blocking issue because of inability to read long lines in the application I'm currently working on. Thanks!
Well done! Going to master.
This PR introduces using of
bufio.Reader
(instead ofbufio.Scanner
) for large lines processing.