rsonquery / rsonpath

Blazing fast JSONPath query engine written in Rust.
https://rsonquery.github.io/rsonpath/
MIT License
50 stars 8 forks source link

Make the compiler produce minimal DFAs #91

Open V0ldek opened 1 year ago

V0ldek commented 1 year ago

Is your feature request related to a problem? Please describe. After #7 the compiler no longer necessarily outputs a minimal DFA. It's still a correct DFA, just not minimal.

Describe the solution you'd like The algorithm in query::automaton::minimizer needs to be rewritten to minimize the automaton.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Example query that produces a non-minimal automaton: $..a.*.*..b.*.*. In the resulting automaton below states 3 and 4 are equivalent.

garph

github-actions[bot] commented 1 year ago

Tagging @V0ldek for notifications

ZwerdDPU commented 1 year ago

Looking into this issue. So far:

It seems to me that the minimzation will only take place within checkpoint-created boundaries. Does this seem correct?

I intend to look further into this issue in the coming weeks. Let me know what you think.

V0ldek commented 1 year ago

Current checkpoint-based code is basically a hack to produce "somewhat-minimal" automata. In particular it can be proven that without wildcards the checkpoints are enough for the result to be truly minimal (Lemma 5.1. in https://v0ldek.com/masters).

The property is that once you reach a checkpoint you can forget all the other states, which means that a checkpoint will always be a unique state in any resulting DFA. So minimizing between them should be enough.

BTW, you can make your life testing this easier with https://paperman.name/semigroup/. It's a tool for automata properties, and allows you to minimize them. If you run rsonpath -c -v <query> it will actually print the DFA in a format that can be directly accepted by the tool.

For example:

just r -c '$..a.*.*..b.*.*' -v

will output as one of the logs:

[rsonpath_lib::query::automaton] NFA: s0.a -> s0, s1;
s0.b -> s0;
s0.X -> s0;
s1.a -> s2;
s1.b -> s2;
s1.X -> s2;
s2.a -> s3;
s2.b -> s3;
s2.X -> s3;
s3.b -> s3, s4;
s3.a -> s3;
s3.X -> s3;
s4.a -> s5;
s4.b -> s5;
s4.X -> s5;
s5.a -> s6;
s5.b -> s6;
s5.X -> s6;

You can copy-paste it into semigroup to get this nice graph:

image

(By convention we mark the wildcard as capital X, since the tool only accepts letters)

So if you come up with an interesting case to test you can easily find the minimal automaton to verify the result.