katef / libfsm

DFA regular expression library & friends
BSD 2-Clause "Simplified" License
931 stars 52 forks source link

Fix incomplete minimisation due to edge_set label collection bug #407

Closed silentbicycle closed 1 year ago

silentbicycle commented 1 year ago

Add better testing for minimisation and fix incomplete minimisation's root cause.

Fixes #404.

Add minimisation test oracle and minimise_test_case_list.c.

Testing was previously done with fsm -t equal, which checks that the state machine matches the same input as the expected result, but doesn't confirm that it is minimal. Also minimised the FSM using a reference implementation and verify that the resulting state count is as expected.

The fuzzer checks whether, for an arbitrary string S, if S can be successfully converted to an NFA, whether using fsm_shuffle to renumber the states can lead to a minimised state count inconsistent with the result from the minimisation test oracle. Syntax errors, etc. are ignored.

To use this, run with env MODE=m.

After integrating the edge_set_check_edges_with_EC_mapping fix I ran this on 8 cores for about two hours without producing any new failures.

Fix the bug in edge_set's label collection during minimisation

Use EC membership, not state id, for label partitioning check.

When minimisation checked the states' edge sets for sets of labels leading to particular destinations it was collecting labels that led to the same state, but it should have instead collected labels to any states within the same EC (which is progressively refined as information propagates during minimisation). This could lead to cases where state(s) would incorrectly split out of an EC, leading to a non-minimal DFA where some states (each representing an EC set) could be combined safely and produce a smaller DFA.

A small input that exercised this bug is "ab*c", which produced a correct but non-minimal DFA with 4 states instead of 3. This was added to tests/minimise/minimise_test_case_list.c as a regression.

Other misc. changes

silentbicycle commented 1 year ago

CI is failing with a linker issue, but only specifically with ASAN gcc builds. I can duplicate that failure mode locally but haven't figured out why it's happening yet.

silentbicycle commented 1 year ago

The CI issue was because the last definition in the file passed to objcopy --keep-global-symbols was getting missed by gcc at link-time, but only when using -fsanitize=address.