laurikari / tre

The approximate regex matching library and agrep command line tool.
Other
803 stars 133 forks source link

Add error messages to some silent failure cases in test-str-source.c #113

Open trushworth opened 1 month ago

trushworth commented 1 month ago

test-str-source fails silently on FreeBSD 14 (when built with waf). Adding these messages is the start of finding out why, and results in the following output:

Match pattern {(foo)\1} against string {xfoofofoofoo} failed, err code 0x00000002 Match pattern {(cat|dog)\1} against string {catcat} failed, err code 0x00000002 Match pattern {(cat|dog)\1} against string {catdog} failed, err code 0x00000002 Match pattern {(cat|dog)\1} against string {dogdog} failed, err code 0x00000002 Match pattern {(cat|dog)\1} against string {dogcat} failed, err code 0x00000002

dag-erling commented 1 month ago

First, please follow the established style when adding or modifying code.

Second, two of the tests (third and fifth) are meant to fail.

Third, all tests work as expected on FreeBSD when built with autotools, so the problem lies with waf.

I've updated my des/windows-ci branch to address these issues, se #112.

trushworth commented 1 month ago

My apologies if I didn't follow the established style, I thaought I was doing so. On reviewing the code the indentation seems to be correct (multiples of 2 spaces), is the problem the {}s when there is only one statement in the block or the placement of the {}s? If not either of those please let me know what I missed so I can do better in the future.

However, reviewing the code here did make another error apparent. The two error messages for failed allocations in the make_str_source() function should say "Could not ...." (i.e. "not" is missing). Let me know if you want me to correct this when I correct whatever the style problem is, or if you want to do it here. For that matter feel free to correct the style problem here too if that's easier than explaining my mistake to me :).

With respect to failing tests, yes failing matches that are supposed to fail are sucessfuls tests, that's not what I'm complaining about. What I'm complaining about is that the RE parser is happy with the patterns (as it should be), but the RE interpreter rejected the compiled pattern altogether. I could probably have expressed it better the first time, but I thought the error code of 2 was sufficient.

As to the where the bug actually is, I'm still chasing that. I built using the autotools as you sugested, and it does indeed run correctly, but both waf and the autotools are using the same binary for the compiler, so unless there is something obscure in the compiler (e.g. different behaviour when called as 'cc' vs 'clang') I doubt that has anything to do with it. I'm NOT ruling out a bug in waf or a bug in the compiler, just investigating other things first.

The one significant difference between what the autotools are testing and what waf is testing is that the waf build is testing a statically linked version of test-str-source.c, and the autotool build is testing a libtoolized dynamically linked version. I'm not particularly eager to dig through the libtoolizing shell script to see why it works because I think chasing why the statically linked version doesn't work is going to be a lot easier :). I will of course update here when I find the root cause.

One last comment. If you want to know why on earth I'm building and testing static versions (the waf build actually builds both static and dynamic), it's because I've been bitten too many times with dynamic linking issues that are hard to sort out because they're different in every OS. I prefer to sort out the code problems first and the linking problems later :).

trushworth commented 1 month ago

OK, I found it.

There's a missing set of {}s in regexec.c. Their absence is harmless if TRE is configured with TRE_USE_ALLOCA, but causes a bad return if TRE_USE_ALLOCA is not defined. The fix is simple, put 'em in :).

This sorta implies I should add yet another variant to the waf build for with/without ALLOCA, but that's a separate job.

It's late here, I'll put a pull request in with the fix first thing tomorrow when I'm less likely to make mistrakes :).

BTW, it turns out I sorta lied about static linking. The waf build does build the static lib, but the tests are dynamically linked. Yet more variants to test...

dag-erling commented 1 month ago

My apologies if I didn't follow the established style, I thaought I was doing so. On reviewing the code the indentation seems to be correct (multiples of 2 spaces), is the problem the {}s when there is only one statement in the block or the placement of the {}s? If not either of those please let me know what I missed so I can do better in the future.

Ville uses the standard GNU style. I don't like it but at least he's consistent. You use K&R style with two-space indents, Yoda conditionals, and no spaces around punctuation.

There's a missing set of {}s in regexec.c. Their absence is harmless if TRE is configured with TRE_USE_ALLOCA, but causes a bad return if TRE_USE_ALLOCA is not defined.

We should remove TRE_USE_ALLOCA for 1.0.0. I've benchmarked it, and while it does make a difference, it's microscopic (less than one tenth of one percent, if memory serves). It's not worth the trouble.

trushworth commented 1 month ago

My apologies if I didn't follow the established style, I thaought I was doing so. On reviewing the code the indentation seems to be correct (multiples of 2 spaces), is the problem the {}s when there is only one statement in the block or the placement of the {}s? If not either of those please let me know what I missed so I can do better in the future.

Ville uses the standard GNU style. I don't like it but at least he's consistent. You use K&R style with two-space indents,

By K&R style I'm guessing you mean the trailing '{' on the end of the line containing the 'if'? For this particular file Ville seems to use two styles. Is the one in the getopt() processing part of the main() function what you are calling K&R and the one from the second 'if' in the make_str_source() what you are calling GNU style? I simply used what I could see on the screen when I started adding, which was the 'if' statements in the getopt() processing, then continued with that style when I added stuff earlier in the file. I absolutely agree with you that consistency should be the goal, but I was unwilling to make changes to the original source code without knowing the original author's preference, and I couldn't sort that out from the one file. How about you and Ville decide and I'll go with whatever you two say?

With respect to the two-space indents, that seems to be what's in the files. To be precise, from looking at lib/regexec.c, the indent seems to be 0 or more tab characters at a tab width of 8, followed by 0, 2, 4, or 6 spaces to give a final visual appearance of two-space indents. I haven't checked every single line in every file mind you, but I have checked several. It's always a bit of a guessing game when when I'm looking at someone else's code. The test-str-source.c file only has 2 tabs mind you, so I just used spaces for indents like the rest of the lines of the file. However, all that said, I'm quite willing to go with whatever Ville wants, and if he doesn't care, then whatever you want, but I'm pretty sure what I submitted is consistent with at least the file it was in.

Yoda conditionals, and no spaces around punctuation.

Heh - I've never heard them called that. Pretty good name :). This use was just habit from someone who has programmed in C for too long and has gotten tired of searching for mistaken assignments that should have been comparisons in conditionals. The Yoda conditionals put the constant value in the L-value position if I accidently type an assignment instead of a comparison, so that the compiler catches it and complains. Most modern compilers can distinguish (and complain) about an assignment found where a conditional is normally to be expected, so the trick is much less useful these days, and since it has always read strangely I'll try to avoid it for this project.

There's a missing set of {}s in regexec.c. Their absence is harmless if TRE is configured with TRE_USE_ALLOCA, but causes a bad return if TRE_USE_ALLOCA is not defined.

You dealt with this in another pull request, and that PR is fine.

We should remove TRE_USE_ALLOCA for 1.0.0. I've benchmarked it, and while it does make a difference, it's microscopic (less than one tenth of one percent, if memory serves). It's not worth the trouble.

The gain in code clarity would certainly be worth the loss of that tiny percentage :). Do you remember what system(s) you ran the bencmark in? Do you still have the code to do it? I ask because I have an even more radical memory management scheme in the projects I'm using TRE for, and I'd like to benchmark it :).

dag-erling commented 1 month ago

By K&R style I'm guessing you mean the trailing '{' on the end of the line containing the 'if'?

Yes, see https://en.wikipedia.org/wiki/Indentation_style

For this particular file Ville seems to use two styles.

No, he's very consistent.

Is the one in the getopt() processing part of the main() function what you are calling K&R and the one from the second 'if' in the make_str_source() what you are calling GNU style? I simply used what I could see on the screen when I started adding, which was the 'if' statements in the getopt() processing, then continued with that style when I added stuff earlier in the file.

Except it was you who added the getopt() loop, not Ville (see a54691197ad0).

Do you remember what system(s) you ran the bencmark in?

FreeBSD 14.1 on an Intel Core i7-10700T (8c/16t) with 16 GB RAM.

Do you still have the code to do it?

No, but it's easy enough to recreate:

#!/bin/sh
set -e
N=10
benchmark() {
    local outfile=$1
    :>"$outfile"
    for i in $(seq $N); do
        /usr/bin/time -ao "$outfile" \
            sh -c 'for j in $(seq '$N'); do ./tests/retest -o /dev/null; done'
    done
}
set -- without-alloca with-alloca
for variant; do
    ./configure --$variant 
    gmake
    gmake -C tests retest
    benchmark $variant
done
ministat -C3 "$@"

Note that the difference was not measurable with N=10, and running time is quadratic. I believe I ended up using N=100 and letting it run overnight. However, it is likely that the branch I was running on had the tre_match() bug that was fixed in #114, so my results may not be valid.

dag-erling commented 1 month ago

Note that my benchmark script configures TRE without debugging, which conversely causes the test suite to be built with malloc debugging, so you will see a siginficant performance benefit (around 10%) from using alloca() because the alternative is the much slower debugging allocator. However, if we enable debugging, the tests generate massive amounts of debugging output, which drives up the run time and makes small differences even harder to measure. To circumvent this, you'll need to edit tests/Makefile.am (remove or comment out the if TRE_DEBUG line and everything from else !TRE_DEBUG to endif !TRE_DEBUG) before running the benchmark script.

(IMO we should drop the malloc debugging stuff, we can use valgrind or platform-specific tools such as leaks(1) on macOS instead. We should also streamline the debugging code and make it possible to turn debugging output on and off without recompiling.)

With this change and N=100 I see a slight benefit from using alloca() on FreeBSD 14 (-2.65% ± 1.29%) and Ubuntu 22 (-2.67% ± 0.94%) but a massive performance drop on macOS 15 (+13.72% ± 0.13% on Apple silicon, +19.15 ± 0.87% on Intel silicon).

trushworth commented 1 month ago

Thanks for the info on the benchmarking, and "yes" to the comments about debugging output :).

I have a pretty well tested set of C macros for controlled debug output, where the production case can be set to either generate no debug code at all or generate code that tests an "enable_debug" runtime flag before doing any further flag checking or argument evaluation, if you don't already have something in mind.

Further discussion on this pull request should probably wait until #112 has been merged.