trentm / go-ecslog

`ecslog` CLI to pretty-print and filter log files in ecs-logging format
Apache License 2.0
7 stars 2 forks source link

crashes on line >64k long #7

Closed trentm closed 3 years ago

trentm commented 3 years ago

Currently have this for the main read loop in internal/ecslog/ecslog.go:

scanner := bufio.NewScanner(in)
for scanner.Scan() {
    line := scanner.Text()
    // ... use line
}
return scanner.Err()

but that fails with a line longer than 64k with

% ./ecslog ../go-ecslog/cmd/ecslog/testdata/crash-long-line.log
ecslog: error: bufio.Scanner: token too long
% echo $?
1

because of bufio.MaxScanTokenSize. That can be set high, I believe, via Scanner.Buffer, but eventually that hits a reasonable limit. We still try to read a whole line into mem however long.

The solution is to no longer use bufio.Scanner.

first attempt

Next tried this:

reader := bufio.NewReader(in)
for {
  line, readErr := reader.ReadBytes('\n')
  // ...
}

Here is are 10GB ... 10MB files that are a single line with no '\n':

python -c '
import sys
token="."*1024
for i in range(10*1024*1024): sys.stdout.write(token)
' >longline.10GB
python -c '
import sys
token="."*1024
for i in range(1024*1024): sys.stdout.write(token)
' >longline.1GB
python -c '
import sys
token="."*1024
for i in range(100*1024): sys.stdout.write(token)
' >longline.100MB
python -c '
import sys
token="."*1024
for i in range(10*1024): sys.stdout.write(token)
' >longline.10MB

Processing those don't go so well (watch mem usage, e.g. via htop -F ecslog):

./ecslog lineline.1GB >/dev/null

because that ReadBytes will keep reading 4kB blocks until it reads the whole thing into memory. Using 10GB of memory isn't acceptable.

next attempt: bufio.Reader.ReadLine

This works well. Fix coming.