onetrueawk / awk

One true awk
Other
1.96k stars 156 forks source link

`awk: regular expression too big: out of space in set_gototab...` #199

Closed enh-google closed 7 months ago

enh-google commented 9 months ago

i'm trying to update the prebuilt one-true-awk binary used to build AOSP, but hitting failures:

/bin/bash -c "(rm -f out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk ) && (cp \"out/soong/.intermediates/external/chromium-webview/webview/android_common/signed/webview.apk\" \"out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk\" ) && (mkdir -p out/target/product/vsoc_arm64/appcompat && rm -f out/target/product/vsoc_arm64/appcompat/webview.log && echo -n \"Package name: \" >> out/target/product/vsoc_arm64/appcompat/webview.log && out/host/linux-x86/bin/aapt2 dump resources out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk | awk -F ' |=' '/^Package/{print \$3}' >> out/target/product/vsoc_arm64/appcompat/webview.log && echo \"Module name in Android tree: webview\" >> out/target/product/vsoc_arm64/appcompat/webview.log && echo \"Local path in Android tree: external/chromium-webview\" >> out/target/product/vsoc_arm64/appcompat/webview.log && echo \"Install path: product/app/webview/webview.apk\" >> out/target/product/vsoc_arm64/appcompat/webview.log && echo >> out/target/product/vsoc_arm64/appcompat/webview.log ) && (echo \"appcompat.sh output:\" >> out/target/product/vsoc_arm64/appcompat/webview.log && PACKAGING=out/target/common/obj/PACKAGING ANDROID_LOG_TAGS=\"*:e\" out/host/linux-x86/bin/appcompat.sh --dex-file=out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk --api-flags=out/soong/hiddenapi/hiddenapi-flags.csv 2>&1 >> out/target/product/vsoc_arm64/appcompat/webview.log )"
awk: regular expression too big: out of space in set_gototab...
 input record number 15671, file 
 source line number 1

(this is a one-true-awk build that includes everything up to and including d8e4368964e4471a54a755823004f2b1aabc0f80.)

looks reproducible without all the rest of the AOSP build, thankfully. here's gawk succeeding:

~/aosp-main-with-phones$ aapt2 dump resources out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk | awk -F ' |=' '/^Package/{print $3}'
com.android.webview

and here's one-true-awk failing:

~/aosp-main-with-phones$ aapt2 dump resources out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk | ./prebuilts/build-tools/linux-x86/bin/one-true-awk -F ' |=' '/^Package/{print $3}'
com.android.webview
./prebuilts/build-tools/linux-x86/bin/one-true-awk: regular expression too big: out of space in set_gototab...
 input record number 15671, file 
 source line number 1
~/aosp-main-with-phones$ 

one-true-awk succeeds if i pipe through head(1) first, which seems consistent with running out of space.

it's not obvious to me why the array size in set_gototab is NCHARS, and i think the current value of NCHARS is a typo? someone typed 1256 instead of 256?

changing that to the presumably correct 256 just makes one-true-awk fail on input line 12256 instead of 15671 though :-)

alternatively

-       f->gototab_len = NCHARS; /* should be variable, growable */
+       f->gototab_len = 100000; /* should be variable, growable */

works a bit better (in that we get the expected line of output), but we also get a bunch of blank lines following that and end with a SIGSEGV.

actually, that might be the right fix? (the input is basically "all the localized strings from Chrome, for every supported language", so, yeah, there's quite a bit of Unicode in there) if i bump NCHARS all the way up to ensure it covers 0x10ffff:

-#define NCHARS (1256+3)                /* 256 handles 8-bit chars; 128 does 7-bit */
-                               /* BUG: some overflows (caught) if we use 256 */
-                               /* watch out in match(), etc. */
+#define NCHARS (0x10ffff+3)    /* 0x10ffff is every Unicode code point */

that works. it's a lot slower than gawk, but it's probably fine, and Unicode performance is a problem for another day:

~/aosp-main-with-phones/external/one-true-awk$ aapt2 dump resources ~/aosp-main-with-phones/out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk | time ~/aosp-main-with-phones/out/host/linux-x86/bin/one-true-awk -F ' |=' '/^Package/{print $3}'
com.android.webview
0.86user 0.02system 0:01.11elapsed 79%CPU (0avgtext+0avgdata 2048maxresident)k
0inputs+0outputs (0major+169minor)pagefaults 0swaps
~/aosp-main-with-phones/external/one-true-awk$ aapt2 dump resources ~/aosp-main-with-phones/out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk | time awk -F ' |=' '/^Package/{print $3}'
com.android.webview
0.02user 0.01system 0:00.31elapsed 9%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+270minor)pagefaults 0swaps
~/aosp-main-with-phones/external/one-true-awk$ 

i'm happy to help do any further debugging if that doesn't seem like the right fix. unfortunately i don't think i can attach the 7MiB input file (2MiB compressed) here, and attempts to repro with 100k blank lines followed by the line we're looking for haven't worked (nor did 100k copies of a random line from the real input).

plan9 commented 9 months ago

thanks for the report. yes, we are aware of the limitations of the gototab. here is the commentary in b.c. the bigger you make it, the slower awk will get. so don't do that. we will fix gototab when we have some time.

   Regular expressions are more complicated, since the basic
   mechanism of the goto table used 8-bit byte indices into the
   gototab entries to compute the next state.  Unicode is a lot
   bigger, so the gototab entries are now structs with a character
   and a next state, and there is a linear search of the characters
   to find the state.  (Yes, this is slower, by a significant
   amount.  Tough.)

   Throughout the RE mechanism in b.c, utf-8 characters are
   converted to their utf-32 value.  This mostly shows up in
   cclenter, which expands character class ranges like a-z and now
   alpha-omega.  The size of a gototab array is still about 256.
   This should be dynamic, but for now things work ok for a single
   code page of Unicode, which is the most likely case.
enh-google commented 9 months ago

the bigger you make it, the slower awk will get. so don't do that.

well, that means one-true-awk's master branch is currently broken for Unicode input. so my choices are basically (a) revert back to before the unicode stuff went in or (b) fix forward (with changing NCHARS being the only easy change).

that seems like an unfortunate situation for any users of one-true-awk.

there doesn't seem to be a compile-time option to disable the Unicode support either, even though it's unfinished?

should downstream users of one-true-awk be using a different, more stable branch if master is expected to be broken?

millert commented 9 months ago

FWIW, PR #195 makes it possible to disable unicode support for the C locale. I don't know if that helps your use case or not...

millert commented 9 months ago

It looks like the code that used to reset the goto table for a state no longer does so. This used to be:

for (i = 0; i < NCHARS; i++)
        f->gototab[2][i] = 0;

It is now:

for (i = 0; i < NCHARS; i++)
        set_gototab(f, 2, 0, 0);

However, that will not actually set the entries to 0. In fact, I don't think it will make any changes at all to the existing values since the condition inside the for() will only be true for an unused slot.

I believe in the new scheme this should be something like:

memset(f->gototab[2], 0, sizeof(gtt) * NCHARS);

Perhaps this could be hidden in a reset_gototab() function since there are two places where the clearing is performed.

No change in the test results after making this change.

plan9 commented 9 months ago

well, that means one-true-awk's master branch is currently broken for Unicode input.

I'm going to disagree. as I said we are quite aware of this limitation.

enh-google commented 9 months ago

FWIW, PR #195 makes it possible to disable unicode support for the C locale. I don't know if that helps your use case or not...

yeah, that's the kind of thing that would be useful while Unicode support is still a work in progress, but unfortunately for me, Android is "C.UTF-8" (so MB_CUR_MAX is 4) :-(

enh-google commented 9 months ago

well, that means one-true-awk's master branch is currently broken for Unicode input.

I'm going to disagree. as I said we are quite aware of this limitation.

i mean, just factually: "one-true-awk now crashes given utf-8 input that it used to not crash on" isn't really something you can disagree with ... the question is whether that's reasonable for the master branch or not.

so it sounds like you're implying that users that don't want work-in-progress changes should only take tagged releases instead of just "whatever's on master", and if i go back to the 20230909 tag i'll be fine, and tagged releases won't include Unicode until it's finished?

if so, fair enough, and i'll do that. (but i think it's worth making it clearer. i assumed from its existence that the staging branch was where work in progress lived...)

enh-google commented 9 months ago

It looks like the code that used to reset the goto table for a state no longer does so. This used to be:

for (i = 0; i < NCHARS; i++)
        f->gototab[2][i] = 0;

It is now:

for (i = 0; i < NCHARS; i++)
        set_gototab(f, 2, 0, 0);

However, that will not actually set the entries to 0. In fact, I don't think it will make any changes at all to the existing values since the condition inside the for() will only be true for an unused slot.

I believe in the new scheme this should be something like:

memset(f->gototab[2], 0, sizeof(gtt) * NCHARS);

Perhaps this could be hidden in a reset_gototab() function since there are two places where the clearing is performed.

No change in the test results after making this change.

yes, this change works for me --- i get the correct result, and the timing is actually slightly faster than gawk:

~/aosp-main-with-phones/external/one-true-awk$ aapt2 dump resources ~/aosp-main-with-phones/out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk | time awk -F ' |=' '/^Package/{print $3}'
com.android.webview
0.01user 0.01system 0:00.29elapsed 11%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+276minor)pagefaults 0swaps
~/aosp-main-with-phones/external/one-true-awk$ 
millert commented 9 months ago

Thanks for testing, I've cleaned this up and submitted #200

plan9 commented 9 months ago

I'm not implying anything @enh-google the current version of awk is most certainly not a "work in progress" and whatever is in the master is always the definitive version, with warts and all. [we had the unicode version as work in progress in a branch literally for months] thanks for the report.

enh-google commented 9 months ago

oh, actually i can upload the gzipped version of the failing AOSP build input; not sure why it didn't work before: in.gz

plan9 commented 9 months ago

thanks, really not necessary, I can trivially reproduce the problem with large utf-8 pages.

arnoldrobbins commented 9 months ago

@enh-google FYI, changing your program to '/^Package/{print $3; exit} works even with this awk. If you know for sure there's only one Package line it's worth doing that, no matter what.

In the meantime, work is in progress to remove the fixed limit you've run into.

enh-google commented 9 months ago

@enh-google FYI, changing your program to '/^Package/{print $3; exit} works even with this awk. If you know for sure there's only one Package line it's worth doing that, no matter what.

heh, interesting idea. i'll see what the build folks think: https://android-review.googlesource.com/c/platform/build/+/2784389

In the meantime, work is in progress to remove the fixed limit you've run into.

thanks! in the meantime i'd be curious to see a specific example where the OpenBSD change doesn't work, if there's a known counterexample for that?

plan9 commented 7 months ago

gototab issues have been fixed for wide-character induced growth.