Closed ianthehenry closed 6 months ago
The fact that there are 32 opcodes is entirely coincidental and I was not even aware of this. I suppose this could result in tighter code generation but I doubt it does.
The sub pattern looks interesting; it's similar to the lookahead combinator but requiring that the pattern is inside the window is useful.
Well that simplifies things! Yeah I couldn't work out why the bitmask was there so I didn't want to remove it. But I pulled that commit out.
Potential idea for distinguishing sep-some
vs sep-any
: an optional "miminum matches" argument. So (sep "," :w+ 0)
for any
, (sep "," :w+ 1)
for some
. I doubt you'd really use anything else, but (sep "," :w+ 10)
would theoretically work... basically works like at-least
.
Another idea is that sep
should always require at least one match, and if you want sep-any
you can just make it optional. Although that's... hmm. Don't love it.
My instinct is sep
and sep+
, using the regex sigils, but that's sort of inconsistent with PEGs...
Unfortunately, I cannot add anything meaningful to the implementation debate. Yet both specials proposed here make sense to me. In my current project, I am using PEG extensively for parsing small user inputs, and at least sub
will be very handy.
Interested in sub
(and am in the midst of implementing it in margaret).
I've not happened to have noticed a case of wanting a sep
-like thing (perhaps just not recognizing opportunities), though it seemed a number of folks were discussing an AOC thing this year that might have benefited?
I wonder if there's some value in adding just one special at first (e.g. sub
) and seeing how it works out in combination with the existing system and then after a bit of exercising, revisiting the other one. May be that's unnecessary.
For sub
, perhaps a test involving error
might be relevant:
(try
(peg/match ~(sequence "a"
(sub "bcd" (error "bc")))
"abcdef")
([e] e))
# =>
"match error at line 1, column 2"
as it makes use of the line and column functionality and the PR touches get_linecol_from_position
which is used from the error
special.
@sogaiu I have the same feeling about sep
, but the arguments in the description of this PR make it a little bit easier to believe for me :-). Anyway, I like the idea of starting with sub
; if everything is correct, consider sep
.
FWIW, in addition to the tests for sub
contained in this PR, I've added some additional tests for the sub
implementation in margaret. I generated the expected values via a janet compiled from this commit of this PR.
For the interested, they live here.
@sogaiu there is a test showing the behavior of the line
/col
/position
helpers, which I think is sufficient to cover the error
case as well (if I understand correctly).
Regarding sep
, I think it does make sense to open a separate PR for that. I'm not sure if the behavior here is exactly right -- really I think of sep
as a way to run string/split
inside a PEG. Except that the implementation here was not exactly that -- it's sort of "string split for as long as you can but stop as soon as something fails to match." Which is maybe less useful than something that requires that all split substrings match exactly.
I just rebased this branch and updated the description. There is another high-level sub
-inspired combinator that I think would be useful, I might open a separate PR for that and sep
/split
if this gets merged.
@ianthehenry The latest commit looks good to me :+1:
Probably it wasn't necessary but I also verified against the additional tests I mentioned earlier.
So far my impression is that sub
is working well with other bits -- have my fingers crossed that remains the case and that we can see additional goodies you have in mind ;)
I found an issue with PEG unmarshaling and I added a previously-failing test for it -- the bytecode verifier was also doing the & 0x1f
bitmasking, and I failed to remove that when I rebased out the 32-opcodes-only commit.
LGTM.
The first commit in this branch condenses the two of the existing opcodes into a single opcode, so that there are still exactly 32 PEG opcodes. I asked about this on Zulip, and as I said there I'm still not very confident that this code is necessary. It does remove a conditional branch on the main bytecode loop, but I worry there might be some 32-bitness thing that I don't understand that might make it even more costly to remove.It then proposes
twoone new specials:sub
(sub window patt)
executes thewindow
pattern, remembers how many bytes it matches, then executespatt
over exactly the bytes matched by thewindow
. This allows you to limit what the subpatt
ern can match, e.g.(sub (to "\n") :foo)
ensures that:foo
does not have a chance to match beyond the current line. Ifpatt
also matches, then the full window is consumed by the(sub)
rule.Another good name for this might be
limit
, as in(limit (to "\n") patt)
, but this doesn't really communicate that the full matched text is equal to thewindow
pattern.On the surface this is similar to
(cmt (% ...) ,|(peg/match ... $))
, but differs in several ways: the subpattern insub
can still refer to:other
patterns inside struct rules, theposition
and other helpers are still relative to the whole input, the pattern can leave multiple captures...sep
(sep separator patt)
is a combination ofsub
,to
, andthru
. The canonical example would be something like~(sep "," :w*)
to match comma-separated words.It's similar to
~(some (* (sub (to (+ ,separator -1)) ,patt) (? ,separator)))
, except that it will not consume terminating separators: the final(? ,separator)
rule is conditional on the next instance of the subpattern matching. A more accurate translation would be:Which is pretty unwieldy. I think this is a common enough pattern to deserve an optimized helper -- it compiles to a lot less bytecode and doesn't match
separator
s twice.Since
sep
repeats, it might be useful to have separte forms likesep-any
andsep-some
.Neither of these combinators are really necessary, as you can do a one-pass thing that ensures that your inner patterns never advance past a separator (which is necessary in the case of a real csv parser that has to handle escapes). But they're convenient for lots of simple ad-hoc parsing tasks (hello from advent of code).