Use bufio.Reader for large lines processing

przmv commented 8 years ago

This PR introduces using of bufio.Reader (instead of bufio.Scanner) for large lines processing.

satyrius commented 8 years ago

Explain how it helps plz. Why it is better?

przmv commented 8 years ago

When the log file line is longer than bufio.Scanner buffer (MaxScanTokenSize = 64 * 1024) it returns ErrTooLong. That means, that you can't process lines longer than 64KB with bufio.Scanner.

With bufio.Reader you can check if the buffer contains the whole line, if not you can't do one more buffered read.

So, this solution is better because it allows to process files with long (> 64KB) lines.

satyrius commented 8 years ago

Oh man... logs with over 64Kb lines long :/ What are you parsing?

Could you make a bench on Scanner vs Reader? Just to be sure we have no valuable speed degradation on normal size logs.

przmv commented 8 years ago

There's no speed difference between Scanner and Reader:

$ make bench 
go test -bench .
..................................
34 total assertions

..............
48 total assertions

.........
57 total assertions

.....
62 total assertions

........................................
102 total assertions

PASS
BenchmarkParseSimpleLogRecord-4   200000          6291 ns/op
BenchmarkParseLogRecord-4          50000         30310 ns/op
BenchmarkScannerReader-4        2000000000           0.00 ns/op
BenchmarkReaderReader-4         2000000000           0.00 ns/op
ok      github.com/pshevtsov/gonx   3.275s

przmv commented 8 years ago

Oh man... logs with over 64Kb lines long :/ What are you parsing?

Yup. It's not the ordinary log file but some huge TSV dump with a few such long lines.

satyrius commented 8 years ago

Benchmark tests looks strange. First of all bench tests should measure one single operation (read line, in our case), but you read whole file. How mane lines in this file? And at the end, results are a little confusing... 0 ns/op?

przmv commented 8 years ago

How many lines in this file?

It reads this file. But all right, I'm going to redo the test cases to read just one line.

And at the end, results are a little confusing... 0 ns/op?

Yes, I also found it a bit weird. Probably that's because the whole file was loaded into memory. I'm going to check.

przmv commented 8 years ago

Oh, I definetely need more sleep (or more :coffee: )

I've just corrected the benchmarks:

BenchmarkScannerReader-4          500000          2040 ns/op
BenchmarkReaderReader-4           500000          2206 ns/op

satyrius commented 8 years ago

So we get 2.0 ms to read line with Scanner and 2.2 to read with Reader. 10% slower on reading.

przmv commented 8 years ago

$ go test -bench Reader -benchmem -benchtime 1m
..................................
34 total assertions

..............
48 total assertions

.........
57 total assertions

.....
62 total assertions

........................................
102 total assertions

PASS
BenchmarkScannerReader-4        50000000          2296 ns/op        4096 B/op          1 allocs/op
BenchmarkReaderReaderAppend-4   50000000          2625 ns/op        4096 B/op          1 allocs/op
BenchmarkReaderReaderBuffer-4   30000000          2818 ns/op        4208 B/op          2 allocs/op
ok      github.com/pshevtsov/gonx   392.334s

satyrius commented 8 years ago

Good job! Do you want to add something or we can merge it?

przmv commented 8 years ago

Please hold on for some time. I'm working on some improvements — e.g. no need for a loop when reading short lines — so both techniques will work the same for the most common use cases (i.e. reading short lines). Also I'd like to test which technique (append or bytes.Buffer) will work better for long lines.

I'll let you know soon. Cheers

przmv commented 8 years ago

Hey,

It turns out that for reading long lines using bytes.Buffer is better:

BenchmarkScannerReader-4              500000          2679 ns/op        4305 B/op          5 allocs/op
BenchmarkReaderReaderAppend-4         500000          2650 ns/op        4305 B/op          5 allocs/op
BenchmarkReaderReaderBuffer-4         500000          2640 ns/op        4305 B/op          5 allocs/op
BenchmarkLongReaderReaderAppend-4        500       2588516 ns/op     5666680 B/op         29 allocs/op
BenchmarkLongReaderReaderBuffer-4       1000       2276376 ns/op     4076730 B/op         20 allocs/op

przmv commented 8 years ago

Hey @satyrius when are you going to merge this PR? I have a blocking issue because of inability to read long lines in the application I'm currently working on. Thanks!

satyrius commented 8 years ago

Well done! Going to master.

satyrius / gonx

Use bufio.Reader for large lines processing #23