janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.38k stars 217 forks source link

Add a new (sub) PEG special #1344

Closed ianthehenry closed 6 months ago

ianthehenry commented 6 months ago

The first commit in this branch condenses the two of the existing opcodes into a single opcode, so that there are still exactly 32 PEG opcodes. I asked about this on Zulip, and as I said there I'm still not very confident that this code is necessary. It does remove a conditional branch on the main bytecode loop, but I worry there might be some 32-bitness thing that I don't understand that might make it even more costly to remove.

It then proposes two one new specials:

sub

(sub window patt) executes the window pattern, remembers how many bytes it matches, then executes patt over exactly the bytes matched by the window. This allows you to limit what the subpattern can match, e.g. (sub (to "\n") :foo) ensures that :foo does not have a chance to match beyond the current line. If patt also matches, then the full window is consumed by the (sub) rule.

Another good name for this might be limit, as in (limit (to "\n") patt), but this doesn't really communicate that the full matched text is equal to the window pattern.

On the surface this is similar to (cmt (% ...) ,|(peg/match ... $)), but differs in several ways: the subpattern in sub can still refer to :other patterns inside struct rules, the position and other helpers are still relative to the whole input, the pattern can leave multiple captures...


sep

I'll move this into its own PR, but I'm preserving the original description so the discussion still makes sense.

(sep separator patt) is a combination of sub, to, and thru. The canonical example would be something like ~(sep "," :w*) to match comma-separated words.

It's similar to ~(some (* (sub (to (+ ,separator -1)) ,patt) (? ,separator))), except that it will not consume terminating separators: the final (? ,separator) rule is conditional on the next instance of the subpattern matching. A more accurate translation would be:

(defn sep [separator patt]
  ~(? (sub (to (+ ,separator -1)) ,patt) (any (* ,separator (sub (to (+ ,separator -1)) ,patt)))))

Which is pretty unwieldy. I think this is a common enough pattern to deserve an optimized helper -- it compiles to a lot less bytecode and doesn't match separators twice.

Since sep repeats, it might be useful to have separte forms like sep-any and sep-some.

Neither of these combinators are really necessary, as you can do a one-pass thing that ensures that your inner patterns never advance past a separator (which is necessary in the case of a real csv parser that has to handle escapes). But they're convenient for lots of simple ad-hoc parsing tasks (hello from advent of code).

bakpakin commented 6 months ago

The fact that there are 32 opcodes is entirely coincidental and I was not even aware of this. I suppose this could result in tighter code generation but I doubt it does.

bakpakin commented 6 months ago

The sub pattern looks interesting; it's similar to the lookahead combinator but requiring that the pattern is inside the window is useful.

ianthehenry commented 6 months ago

Well that simplifies things! Yeah I couldn't work out why the bitmask was there so I didn't want to remove it. But I pulled that commit out.

Potential idea for distinguishing sep-some vs sep-any: an optional "miminum matches" argument. So (sep "," :w+ 0) for any, (sep "," :w+ 1) for some. I doubt you'd really use anything else, but (sep "," :w+ 10) would theoretically work... basically works like at-least.

Another idea is that sep should always require at least one match, and if you want sep-any you can just make it optional. Although that's... hmm. Don't love it.

My instinct is sep and sep+, using the regex sigils, but that's sort of inconsistent with PEGs...

pepe commented 6 months ago

Unfortunately, I cannot add anything meaningful to the implementation debate. Yet both specials proposed here make sense to me. In my current project, I am using PEG extensively for parsing small user inputs, and at least sub will be very handy.

sogaiu commented 6 months ago

Interested in sub (and am in the midst of implementing it in margaret).

I've not happened to have noticed a case of wanting a sep-like thing (perhaps just not recognizing opportunities), though it seemed a number of folks were discussing an AOC thing this year that might have benefited?

I wonder if there's some value in adding just one special at first (e.g. sub) and seeing how it works out in combination with the existing system and then after a bit of exercising, revisiting the other one. May be that's unnecessary.


For sub, perhaps a test involving error might be relevant:

  (try
    (peg/match ~(sequence "a"
                          (sub "bcd" (error "bc")))
               "abcdef")
    ([e] e))
  # =>
  "match error at line 1, column 2"

as it makes use of the line and column functionality and the PR touches get_linecol_from_position which is used from the error special.

pepe commented 6 months ago

@sogaiu I have the same feeling about sep, but the arguments in the description of this PR make it a little bit easier to believe for me :-). Anyway, I like the idea of starting with sub; if everything is correct, consider sep.

sogaiu commented 6 months ago

FWIW, in addition to the tests for sub contained in this PR, I've added some additional tests for the sub implementation in margaret. I generated the expected values via a janet compiled from this commit of this PR.

For the interested, they live here.

ianthehenry commented 6 months ago

@sogaiu there is a test showing the behavior of the line/col/position helpers, which I think is sufficient to cover the error case as well (if I understand correctly).

Regarding sep, I think it does make sense to open a separate PR for that. I'm not sure if the behavior here is exactly right -- really I think of sep as a way to run string/split inside a PEG. Except that the implementation here was not exactly that -- it's sort of "string split for as long as you can but stop as soon as something fails to match." Which is maybe less useful than something that requires that all split substrings match exactly.

I just rebased this branch and updated the description. There is another high-level sub-inspired combinator that I think would be useful, I might open a separate PR for that and sep/split if this gets merged.

sogaiu commented 6 months ago

@ianthehenry The latest commit looks good to me :+1:

Probably it wasn't necessary but I also verified against the additional tests I mentioned earlier.

So far my impression is that sub is working well with other bits -- have my fingers crossed that remains the case and that we can see additional goodies you have in mind ;)

ianthehenry commented 6 months ago

I found an issue with PEG unmarshaling and I added a previously-failing test for it -- the bytecode verifier was also doing the & 0x1f bitmasking, and I failed to remove that when I rebased out the 32-opcodes-only commit.

bakpakin commented 6 months ago

LGTM.