Can a token capture sequence (TCS) evaluate to an empty string?

toraritte commented 5 years ago

According to the spec, the TCS "will capture any amount of text that occurs between the adjacent text literals".

In my interpretation it means that it is allowed, but he "greedy token capture modifier" section (stating that it "captures as much text as possible between preceding and following string literals") and its example contradicts this:

Original example: PATTERN: "bar %{0G} foo %{1}" INPUT: "bar foo bar foo bar foo bar foo" => "bar (0: foo bar foo bar) foo (1: bar foo)"
If empty strings are allowed: => "bar (0: foo bar foo bar foo bar) foo (1: "")"

This would also affect when would a pattern using whitespace modifier match:

The original example would match the line either way, PATTERN: "the %{0S1} %{1} ran away" INPUT: "the big brown fox ran away" => "the (0: big brown) (1: fox) ran away"
but replacing %{0S1} with %{0S2} would yield different results: empty string OK => "the (0: big brown fox) (1: "") ran away" else => no match

Some corner cases also wouldn't match, that weren't listed in the description:

PATTERN: "the %{0} is a %{1}" INPUT: "the cake is a"
PATTERN: "foo %{0} bar" INPUT: "foo bar"

toraritte commented 5 years ago

The spec also states that the input pattern needs to be translated to a regex that will be applied to the input text.

Experiment with regexes first and see how complexity changes either way.
What contexts would this pattern DSL be useful in? It may be that in one of these domains an empty capture doesn't even make sense.

toraritte commented 5 years ago

Regex patterns that conform to the spec:

1. Token capture sequences

%{n} = /\b(.*)\b/

E.g.: "foo %{0} is a %{1}" -> /foo \b(.*)\b is a \b(.*)\b/

Corollary: "foo is a" won't be matched, which makes sense as I forgot to take into consideration that spaces are part of the pattern DSL, and therefore there has to be a "word" boundary.

  re> /foo \b(.*)\b is a \b(.*)\b/

data> foo bla is a bar
 0: foo bla is a bar
 1: bla
 2: bar

data> foo bla is a ver big boat     
 0: foo bla is a ver big boat
 1: bla
 2: ver big boat

data> foo is a 
No match

data> foo bala bab is a very big boat
 0: foo bala bab is a very big boat
 1: bala bab
 2: very big boat

toraritte commented 5 years ago

2. Space limitation modifier (SLM)

With the regex choice in 1., the behaviour of the ambiguous case matches the results of the spec example"the %{0S1} %{1} ran away":

  re> /the \b(.*)\b \b(.*)\b ran away/
data> the big brown fox ran away
 0: the big brown fox ran away
 1: big brown
 2: fox

Initial choices for this modifier:

%{nS0} = /\b(\S+)\b/
%{nSN} = /\b(\S+\s{1}\S+)  ...  \s{1}\S+\b/
                --- 1 ---       --- N --

(Recursive patterns would come in handy, but if this pattern works out then it can be built iteratively without the added mental complexity. We'll see.)

----------------------------------------
--- %{0S1} -> same as ambiguous case ---
----------------------------------------
  re> /the \b(\S+\s{1}\S+)\b \b(.*)\b ran away/
data> the big brown fox ran away
 0: the big brown fox ran away
 1: big brown
 2: fox

---------------------------------------------
--- %{0S0}                                ---
--- This behaviour hasn't been specified, ---
--- but it still matches the entire line. ---
---------------------------------------------
  re> /the \b(\S+)\b \b(.*)\b ran away/
data> the big brown fox ran away
 0: the big brown fox ran away
 1: big
 2: brown fox

---------------------------------------------
--- %{0S2}                                ---
--- Again, not specified, but consistent  ---
--- with the non-empty capturing groups   ---
--- case.                                 ---
---------------------------------------------
  re> /the \b(\S+\s{1}\S+\s{1}\S+)\b \b(.*)\b ran away/
data> the big brown fox ran away
No match

toraritte commented 5 years ago

Wrap the generated regexes between anchors ^ and $. Without this requirement the SLM wouldn't work according to the spec when used at the end of the pattern:

            --%{0}--      --%{1S0}-
  re> /foo \b(.*)\b is a \b(\S+)\b/
data> foo blah is a bar
 0: foo blah is a bar
 1: blah
 2: bar
data> foo blah is a very big boat
 0: foo blah is a very
 1: blah
 2: very

  re> /^foo \b(.*)\b is a \b(\S+)\b$/
data> foo blah is a bar
 0: foo blah is a bar
 1: blah
 2: bar
data> foo blah is a very big boat
No match

(All above examples passed when being wrapped.)

toraritte commented 5 years ago

3. Greedy token capture modifier

Making %{nG} the same as %{n} works fine and consistent with the choice for SLM.

            --%{0}--     --%{1}--
  re> /^bar \b(.*)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
 0: bar foo bar foo bar foo bar foo
 1: foo bar foo bar
 2: bar foo

            -%{0S0}-      --%{1}--
  re> /^bar \b(\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
No match

            -----%{0S1}-----      --%{1}--
  re> /^bar \b(\S+\s{1}\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
 0: bar foo bar foo bar foo bar foo
 1: foo bar
 2: bar foo bar foo

            ---------%{0S2}----------     --%{1}--
  re> /^bar \b(\S+\s{1}\S+\s{1}\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
No match

            -------------%{0S3}--------------     --%{1}--
  re> /^bar \b(\S+\s{1}\S+\s{1}\S+\s{1}\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
 0: bar foo bar foo bar foo bar foo
 1: foo bar foo bar
 2: bar foo

TODO: Test corner cases.

toraritte commented 5 years ago

Closing because the current implementation won't allow it (at least until 4ef82f3087f6f2e85905ab14fb309fcbf91b93c1), and it also lines up nicely with the specification.

toraritte / program