hhatto / gocloc

A little fast cloc(Count Lines Of Code)
MIT License
792 stars 79 forks source link

scannerloop stops after encountering a very long line #81

Open timmattison opened 7 months ago

timmattison commented 7 months ago

If you have a line longer than the 1MB buffer length (don't ask) the scannerloop's scanner.Scan() for condition will evaluate to false. When this happens line counting for the current file stops where it is and reports incorrect results for that file.

https://github.com/hhatto/gocloc/blob/7b24285f3e4368e0b3df5cd16b0969f3c9be03cb/file.go#L90

I could see a few fixes for this.

1) A new option to set the buffer size with a maximum of 1MB being the default if it is unset:

    if opts.MaxLineLength > 0 {
        scanner.Buffer(buf.Bytes(), opts.MaxLineLength)
    } else {
        scanner.Buffer(buf.Bytes(), 1024*1024)
    }

2) Scanning the files ahead of time to find the longest gap between line endings and then automatically setting that as the buffer size. This does require reading the file twice though.

3) Changing the scannerloop to use something like mmap instead of scanner.

If you're interested in the third one let me know and I'll work on a PR.

The first one probably touches a bit more of the overall design than I should take on for a first PR.

I think the second one is safe but it does double the I/O required. Disk caching may make this less of an issue than doubling the amount of raw data read from disk but still feels like a last resort.