Closed enh-google closed 7 months ago
thanks for the report. yes, we are aware of the limitations of the gototab. here is the commentary in b.c. the bigger you make it, the slower awk will get. so don't do that. we will fix gototab when we have some time.
Regular expressions are more complicated, since the basic
mechanism of the goto table used 8-bit byte indices into the
gototab entries to compute the next state. Unicode is a lot
bigger, so the gototab entries are now structs with a character
and a next state, and there is a linear search of the characters
to find the state. (Yes, this is slower, by a significant
amount. Tough.)
Throughout the RE mechanism in b.c, utf-8 characters are
converted to their utf-32 value. This mostly shows up in
cclenter, which expands character class ranges like a-z and now
alpha-omega. The size of a gototab array is still about 256.
This should be dynamic, but for now things work ok for a single
code page of Unicode, which is the most likely case.
the bigger you make it, the slower awk will get. so don't do that.
well, that means one-true-awk's master branch is currently broken for Unicode input. so my choices are basically (a) revert back to before the unicode stuff went in or (b) fix forward (with changing NCHARS being the only easy change).
that seems like an unfortunate situation for any users of one-true-awk.
there doesn't seem to be a compile-time option to disable the Unicode support either, even though it's unfinished?
should downstream users of one-true-awk be using a different, more stable branch if master is expected to be broken?
FWIW, PR #195 makes it possible to disable unicode support for the C locale. I don't know if that helps your use case or not...
It looks like the code that used to reset the goto table for a state no longer does so. This used to be:
for (i = 0; i < NCHARS; i++)
f->gototab[2][i] = 0;
It is now:
for (i = 0; i < NCHARS; i++)
set_gototab(f, 2, 0, 0);
However, that will not actually set the entries to 0. In fact, I don't think it will make any changes at all to the existing values since the condition inside the for() will only be true for an unused slot.
I believe in the new scheme this should be something like:
memset(f->gototab[2], 0, sizeof(gtt) * NCHARS);
Perhaps this could be hidden in a reset_gototab() function since there are two places where the clearing is performed.
No change in the test results after making this change.
well, that means one-true-awk's master branch is currently broken for Unicode input.
I'm going to disagree. as I said we are quite aware of this limitation.
FWIW, PR #195 makes it possible to disable unicode support for the C locale. I don't know if that helps your use case or not...
yeah, that's the kind of thing that would be useful while Unicode support is still a work in progress, but unfortunately for me, Android is "C.UTF-8" (so MB_CUR_MAX is 4) :-(
well, that means one-true-awk's master branch is currently broken for Unicode input.
I'm going to disagree. as I said we are quite aware of this limitation.
i mean, just factually: "one-true-awk now crashes given utf-8 input that it used to not crash on" isn't really something you can disagree with ... the question is whether that's reasonable for the master branch or not.
so it sounds like you're implying that users that don't want work-in-progress changes should only take tagged releases instead of just "whatever's on master", and if i go back to the 20230909
tag i'll be fine, and tagged releases won't include Unicode until it's finished?
if so, fair enough, and i'll do that. (but i think it's worth making it clearer. i assumed from its existence that the staging
branch was where work in progress lived...)
It looks like the code that used to reset the goto table for a state no longer does so. This used to be:
for (i = 0; i < NCHARS; i++) f->gototab[2][i] = 0;
It is now:
for (i = 0; i < NCHARS; i++) set_gototab(f, 2, 0, 0);
However, that will not actually set the entries to 0. In fact, I don't think it will make any changes at all to the existing values since the condition inside the for() will only be true for an unused slot.
I believe in the new scheme this should be something like:
memset(f->gototab[2], 0, sizeof(gtt) * NCHARS);
Perhaps this could be hidden in a reset_gototab() function since there are two places where the clearing is performed.
No change in the test results after making this change.
yes, this change works for me --- i get the correct result, and the timing is actually slightly faster than gawk:
~/aosp-main-with-phones/external/one-true-awk$ aapt2 dump resources ~/aosp-main-with-phones/out/target/product/vsoc_arm64/obj/APPS/webview_intermediates/package.apk | time awk -F ' |=' '/^Package/{print $3}'
com.android.webview
0.01user 0.01system 0:00.29elapsed 11%CPU (0avgtext+0avgdata 3072maxresident)k
0inputs+0outputs (0major+276minor)pagefaults 0swaps
~/aosp-main-with-phones/external/one-true-awk$
Thanks for testing, I've cleaned this up and submitted #200
I'm not implying anything @enh-google the current version of awk is most certainly not a "work in progress" and whatever is in the master is always the definitive version, with warts and all. [we had the unicode version as work in progress in a branch literally for months] thanks for the report.
oh, actually i can upload the gzipped version of the failing AOSP build input; not sure why it didn't work before: in.gz
thanks, really not necessary, I can trivially reproduce the problem with large utf-8 pages.
@enh-google FYI, changing your program to '/^Package/{print $3; exit}
works even with this awk. If you know for sure there's only one Package
line it's worth doing that, no matter what.
In the meantime, work is in progress to remove the fixed limit you've run into.
@enh-google FYI, changing your program to
'/^Package/{print $3; exit}
works even with this awk. If you know for sure there's only onePackage
line it's worth doing that, no matter what.
heh, interesting idea. i'll see what the build folks think: https://android-review.googlesource.com/c/platform/build/+/2784389
In the meantime, work is in progress to remove the fixed limit you've run into.
thanks! in the meantime i'd be curious to see a specific example where the OpenBSD change doesn't work, if there's a known counterexample for that?
gototab issues have been fixed for wide-character induced growth.
i'm trying to update the prebuilt one-true-awk binary used to build AOSP, but hitting failures:
(this is a one-true-awk build that includes everything up to and including d8e4368964e4471a54a755823004f2b1aabc0f80.)
looks reproducible without all the rest of the AOSP build, thankfully. here's gawk succeeding:
and here's one-true-awk failing:
one-true-awk succeeds if i pipe through head(1) first, which seems consistent with running out of space.
it's not obvious to me why the array size in set_gototab is NCHARS, and i think the current value of NCHARS is a typo? someone typed 1256 instead of 256?
changing that to the presumably correct 256 just makes one-true-awk fail on input line 12256 instead of 15671 though :-)
alternatively
works a bit better (in that we get the expected line of output), but we also get a bunch of blank lines following that and end with a SIGSEGV.
actually, that might be the right fix? (the input is basically "all the localized strings from Chrome, for every supported language", so, yeah, there's quite a bit of Unicode in there) if i bump NCHARS all the way up to ensure it covers 0x10ffff:
that works. it's a lot slower than gawk, but it's probably fine, and Unicode performance is a problem for another day:
i'm happy to help do any further debugging if that doesn't seem like the right fix. unfortunately i don't think i can attach the 7MiB input file (2MiB compressed) here, and attempts to repro with 100k blank lines followed by the line we're looking for haven't worked (nor did 100k copies of a random line from the real input).