katef / kgt

BNF wrangling and railroad diagrams
BSD 2-Clause "Simplified" License
587 stars 30 forks source link

Exponential grow for case insensitive input. #37

Open ArthurSonzogni opened 3 years ago

ArthurSonzogni commented 3 years ago

Hi, Thanks for this tool. This is awesome!

I noticed some generator produce all the possibles combinaisons for case incentive input. It means this generate a list of item growing exponentially relative to the input length.

For instance if you use: (input: abnf)

test = "test"

Then you get your get. (output: "rrta")

add('test', Diagram(
  Choice(0,
    Terminal("test"),
    Terminal("Test"),
    Terminal("tEst"),
    Terminal("TEst"),
    Terminal("teSt"),
    Terminal("TeSt"),
    Terminal("tESt"),
    Terminal("TESt"),
    Terminal("tesT"),
    Terminal("TesT"),
    Terminal("tEsT"),
    Terminal("TEsT"),
    Terminal("teST"),
    Terminal("TeST"),
    Terminal("tEST"),
    Terminal("TEST"))));

I was curious to know if this is something you want to keep or fix later. If you want to fix it later, would you prefer I try to submit PR?

I tried all the generator online using this tool https://arthursonzogni.com/Diagon/#code_area

exponential output:

non exponential output:

Can't say (reached unimplemented)

katef commented 3 years ago

This is pretty terrible, I agree. I was trying to think about better ways to handle this, when I wrote it. Do you have any suggestions please?

The problem is that those dialects don't support case insensitive terminals, so we have to expand out every spelling.

ArthurSonzogni commented 3 years ago

I was thinking: instead of outputting every combination, to rewrite it using a sequence of alternatives:

Input:

http

output (current)

| http
| Http
| hTtp
| HTtp
| htTp
| HtTp
| hTTp
| HTTp
| httP
| HttP
| hTtP
| HTtP
| htTP
| HtTP
| hTTP
| HTTP

output (proposed;)

 (h | H) (t | T) (t | T) (p | P)

Obviously, this is a very naive idea from me. I guess there are reasons this can't be done. I just found out this is implemented by permute_case. I will take a look.

katef commented 3 years ago

Yes, unfortunately that has a different meaning. For a grammar where there could be whitespace between tokens (depending on how lexing is defined), here you're introducing the possibility for space between each letter, and to produce multiple tokens for output. The original grammar there has only one token for the entire string.