onetrueawk / awk

One true awk
Other
1.96k stars 156 forks source link

Excessive memory usage regression #227

Closed hiltjo closed 2 months ago

hiltjo commented 3 months ago

Hi,

I upgraded from OpenBSD 7.4 to OpenBSD 7.5 which uses onetrueawk and noticed a regression. I have a script which uses awk to process mail and httpd logs and noticed excessive memory usage compared to the previous version.

From some bisecting I found this:

The commit 9e254e503f844e122870e9488db3d7b0233e554c from 16 nov 2023 seems to be OK. The commit 345f907c404ff05165834601009835a42c90463d from 20 nov 2023 and onwards seems to have this behaviour.

Below is a small simplified test script to (hopefully) reproduce the behaviour on your machine also.

Script:

AWK="$HOME/tmp/onetrueawk/awk/a.out"

# generate test input (~138MB), simulates a log file.
generate() {
    i=0
    while :; do
        echo '*.codemadness.org 127.0.0.1 - - [14/Apr/2024:00:00:43 +0200] "GET / HTTP/1.1" 200 6 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"'
        i=$((i + 1))
        test "$i" = "696969" && break
    done > input.txt
}

run() {
    LC_ALL=C $AWK '/"POST /' < input.txt
}

generate
run

Thank you,

plan9 commented 3 months ago

hi hiltjo, thanks for spotting this. I tracked the issue to a change we made to regular expression engine to deal with a pathological case where gototab was blowing up. we now resize that table when needed. in some cases (eg. /"POST / but not /POST /) this is causing the excessive memory use and slowdown you observe. I know both examples run without any issues in earlier versions. I will look into this.

plan9 commented 2 months ago

fixed, thanks Arnold.