Closed toraritte closed 5 years ago
The spec also states that the input pattern needs to be translated to a regex that will be applied to the input text.
Regex patterns that conform to the spec:
%{n} = /\b(.*)\b/
E.g.: "foo %{0} is a %{1}" -> /foo \b(.*)\b is a \b(.*)\b/
Corollary: "foo is a"
won't be matched, which makes sense as I forgot to take into consideration that spaces are part of the pattern DSL, and therefore there has to be a "word" boundary.
re> /foo \b(.*)\b is a \b(.*)\b/
data> foo bla is a bar
0: foo bla is a bar
1: bla
2: bar
data> foo bla is a ver big boat
0: foo bla is a ver big boat
1: bla
2: ver big boat
data> foo is a
No match
data> foo bala bab is a very big boat
0: foo bala bab is a very big boat
1: bala bab
2: very big boat
With the regex choice in 1., the behaviour of the ambiguous case matches the results of the spec example"the %{0S1} %{1} ran away"
:
re> /the \b(.*)\b \b(.*)\b ran away/
data> the big brown fox ran away
0: the big brown fox ran away
1: big brown
2: fox
Initial choices for this modifier:
%{nS0} = /\b(\S+)\b/
%{nSN} = /\b(\S+\s{1}\S+) ... \s{1}\S+\b/
--- 1 --- --- N --
(Recursive patterns would come in handy, but if this pattern works out then it can be built iteratively without the added mental complexity. We'll see.)
----------------------------------------
--- %{0S1} -> same as ambiguous case ---
----------------------------------------
re> /the \b(\S+\s{1}\S+)\b \b(.*)\b ran away/
data> the big brown fox ran away
0: the big brown fox ran away
1: big brown
2: fox
---------------------------------------------
--- %{0S0} ---
--- This behaviour hasn't been specified, ---
--- but it still matches the entire line. ---
---------------------------------------------
re> /the \b(\S+)\b \b(.*)\b ran away/
data> the big brown fox ran away
0: the big brown fox ran away
1: big
2: brown fox
---------------------------------------------
--- %{0S2} ---
--- Again, not specified, but consistent ---
--- with the non-empty capturing groups ---
--- case. ---
---------------------------------------------
re> /the \b(\S+\s{1}\S+\s{1}\S+)\b \b(.*)\b ran away/
data> the big brown fox ran away
No match
Wrap the generated regexes between anchors ^
and $
. Without this requirement the SLM wouldn't work according to the spec when used at the end of the pattern:
--%{0}-- --%{1S0}-
re> /foo \b(.*)\b is a \b(\S+)\b/
data> foo blah is a bar
0: foo blah is a bar
1: blah
2: bar
data> foo blah is a very big boat
0: foo blah is a very
1: blah
2: very
re> /^foo \b(.*)\b is a \b(\S+)\b$/
data> foo blah is a bar
0: foo blah is a bar
1: blah
2: bar
data> foo blah is a very big boat
No match
(All above examples passed when being wrapped.)
Making %{nG}
the same as %{n}
works fine and consistent with the choice for SLM.
--%{0}-- --%{1}--
re> /^bar \b(.*)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
0: bar foo bar foo bar foo bar foo
1: foo bar foo bar
2: bar foo
-%{0S0}- --%{1}--
re> /^bar \b(\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
No match
-----%{0S1}----- --%{1}--
re> /^bar \b(\S+\s{1}\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
0: bar foo bar foo bar foo bar foo
1: foo bar
2: bar foo bar foo
---------%{0S2}---------- --%{1}--
re> /^bar \b(\S+\s{1}\S+\s{1}\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
No match
-------------%{0S3}-------------- --%{1}--
re> /^bar \b(\S+\s{1}\S+\s{1}\S+\s{1}\S+)\b foo \b(.*)\b$/
data> bar foo bar foo bar foo bar foo
0: bar foo bar foo bar foo bar foo
1: foo bar foo bar
2: bar foo
TODO: Test corner cases.
Closing because the current implementation won't allow it (at least until 4ef82f3087f6f2e85905ab14fb309fcbf91b93c1), and it also lines up nicely with the specification.
According to the spec, the TCS "will capture any amount of text that occurs between the adjacent text literals".
In my interpretation it means that it is allowed, but he "greedy token capture modifier" section (stating that it "captures as much text as possible between preceding and following string literals") and its example contradicts this:
Original example: PATTERN:
"bar %{0G} foo %{1}"
INPUT:"bar foo bar foo bar foo bar foo"
=> "bar (0:foo bar foo bar
) foo (1:bar foo
)"If empty strings are allowed: => "bar (0:
foo bar foo bar foo bar
) foo (1:""
)"This would also affect when would a pattern using whitespace modifier match:
The original example would match the line either way, PATTERN:
"the %{0S1} %{1} ran away"
INPUT:"the big brown fox ran away"
=> "the (0:big brown
) (1:fox
) ran away"but replacing
%{0S1}
with%{0S2}
would yield different results: empty string OK => "the (0:big brown fox
) (1:""
) ran away" else => no matchSome corner cases also wouldn't match, that weren't listed in the description:
PATTERN:
"the %{0} is a %{1}"
INPUT:"the cake is a"
PATTERN:
"foo %{0} bar"
INPUT:"foo bar"