janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.54k stars 228 forks source link

Add PEG special to get the nth capture #1503

Closed katomuso closed 2 months ago

katomuso commented 2 months ago

Take a look at the following example, where in each (* :prefix ":" :word) I have two captures: number designating the following string length (:prefix) and the string itself (:word):

(def s "5:apple6:banana6:cherry")
(def g ~{:prefix (number :d+ nil :n)
         :word '(lenprefix (-> :n) :w)
         :main (some (* :prefix ":" :word))})
(peg/match g s)
# => @[5 "apple" 6 "banana" 6 "cherry"]

The problem is that I want to get only the strings themselves, not their prefixes. Currently, to do that I need to use (cmt ... ,|$1) which is not that convenient, as to understand it I need to jump to the very end to find out which capture I want to get and then back to the beginning:

(def s "5:apple6:banana6:cherry")
(def g ~{:prefix (number :d+ nil :n)
         :word '(lenprefix (-> :n) :w)
         :main (some (cmt (* :prefix ":" :word) ,|$1))})
(peg/match g s)
# => @["apple" "banana" "cherry"]

It would be nicer if there were a special like (nth n patt), so the previous example would look like this instead:

(def s "5:apple6:banana6:cherry")
(def g ~{:prefix (number :d+ nil :n)
         :word '(lenprefix (-> :n) :w)
         :main (some (nth 1 (* :prefix ":" :word)))})
(peg/match g s)
# => @["apple" "banana" "cherry"]
bakpakin commented 2 months ago

There are a couple ways to work around this, but I assume this is simply an example so it may not be what you are looking for. And there is the obvious option of simply extracting all of the words and simply using (get words 1) to get the second word.

However, the main idea is to simply have two kinds of words - captured words and non capture words.

(def str "5:apple6:banana6:cherry")
(def grammar
  ~{:word (lenprefix (* '(number :d+) ":") :w)
    :main (* :word ':word (any :word))})
(def grammar-compiled (peg/compile grammar))
(pp (peg/match grammar-compiled str))

To get the only the second word, we capture a single throw-away word, the word we actually want, and then any number of subsequent words (or we could ignore them).

Does this work for your use case? Getting the nth capture is, to me, not a natural operation in a parser, so if you want to do it you need to use the cmt special which is a sort of escape hatch for arbitrary computation during parsing.

EDIT: misread the problem, I assumed we were trying to get the second word "banana", not discard the prefix

bakpakin commented 2 months ago

So reading a bit more into the example, I guess there is a little more here. Assuming we are constraining ourselves with the following:

  1. Each word we want to capture will generate more than one capture.
  2. We want to pull a single capture out the several created captures.

I suppose and nth combinator could do this, but I think another mechanism here would be something like backref and drop combined into one rule. Basically, get a tagged capture and drop everything else. I think that might be easier to use than need to keep track of capture indices.

katomuso commented 2 months ago

Works like a charm, thanks!