janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.38k stars 217 forks source link

add a new (split) PEG special #1346

Closed ianthehenry closed 6 months ago

ianthehenry commented 6 months ago

This works similarly to string/split, but the separator is a PEG.

I also added a peg/split helper function, and a new opcode to optimize the (to -1) pattern so that it will only do two passes over the input instead of three.

sogaiu commented 6 months ago

With respect to peg/split, perhaps there will be requests for it later and if so, addition can be considered at that time? (Although I have wanted something along those lines before and imagine others have too.)

sogaiu commented 6 months ago

For any other folks who might be checking this PR out, highly recommended are the tests for gaining some familiarity.

As the tests demonstrate, it seems worth noting that captures get dropped from the separator pattern and that the separator pattern doesn't have to match anything.

That the "separator can be an arbitrary PEG" is especially nice (possibly one of the main motivators?) and what inclines me toward a peg/split function (^^;

ianthehenry commented 6 months ago

I went ahead and added a peg/split helper function too; it seems like it's generally useful.

sogaiu commented 6 months ago

Working well here so far.

Nice that the start and args equivalents of peg/match seem to work too.

sogaiu commented 6 months ago

So I've been studying RULE_SPLIT using gdb (actually rr [1]) a bit and for the following Janet code (one of the tests I think):

(peg/match ~(split "," (capture (to -1)))
           "a,,bar,c")

when RULE_SPLIT is done and returning, text seems to point to what seems like garbage "\031\373\207\253\325\376\377A" -- it's one byte beyond the terminating zero byte I think (see below for more details).

What I'm seeing in the locals view is this:

 const uint8_t *            text 0x55ab899fb241 "\031\373\207\253\325\376\377A"

The rest of the locals view at the time looks like this:

 const uint8_t *       saved_end 0x55ab899fb240 ""
const uint32_t *  rule_separator 0x55ab899fbdec
const uint32_t * rule_subpattern 0x55ab899fbdf8
 const uint8_t *   separator_end 0x0
      PegState *               s 0x7ffceeb16f98
const uint32_t *            rule 0x55ab899fbde0

saved_end is at 0x55ab899fb240 -- this is one byte before text 0x55ab899fb241.

rr says:

(rr) p s->text_start
$1 = (const uint8_t *) 0x55ab899fb238 "a,,bar,c"

(rr) p saved_end
$2 = (const uint8_t *) 0x55ab899fb240 ""

(rr) p s->text_start + 8
$3 = (const uint8_t *) 0x55ab899fb240 ""

(rr) p *(s->text_start + 8)
$4 = 0 '\000'

(rr) p s->text_start + 9
$5 = (const uint8_t *) 0x55ab899fb241 "\031\373\207\253\325\376\377A"

IIUC, s->text_start + 8 is the same as saved_end and it points to the terminating zero byte.

The code works fine here but this feels odd. May be it's nothing to worry about?


[1] See this gist for using rr with janet if interested.

ianthehenry commented 6 months ago

Hey thanks! That's interesting. It does seem wrong to advance past the input; I don't know if anything bad would happen but it seems plausible. I don't think any other rule does it, so split probably shouldn't either.

I just pushed a fix although I'm not really sure how to write an automated test for this; I'll try harder when I get off work.

sogaiu commented 6 months ago

I've no clue how one would write a test for this.

Perhaps it isn't necessary to do so.

ianthehenry commented 6 months ago

I found a test that failed with the old behavior but not the new behavior: the rule 0 should be a no-op that always succeeds, but (* (split ...) 0) would fail because it had advanced past the end of the input. This works as expected after the fix this morning. Great catch!

sogaiu commented 6 months ago

Nice that you found that test and thanks for the fix :tada:

bakpakin commented 6 months ago

So I don't like how peg/split doesn't accept precompiled PEGs unlike all of the other peg functions. I think we should remove that particular fuction for that reason, or implement it so that that works.

bakpakin commented 6 months ago

the TO_END rule also looks good and seems to be a good way to possibly speed up some matches. That said, I think for more optimizations a benchmark suite is needed. I'm reluctant to add performance optimizations without measurement unless they are trivial.

ianthehenry commented 6 months ago

Oh yeah, I didn't think of that! It would be possible but maybe crazy to add a RULE_RUN_THIS_OTHER_PEG and allow interpolating pre-compiled PEGs inside arbitrary PEGs. Or to copy in their bytecode/constants with adjusted offsets. But probably the simplest thing is to implement peg/split as a separate cfunction.

I just removed that commit for now; I'm not really convinced the function is pulling its weight if it requires re-implementing the split logic.

bakpakin commented 6 months ago

Ok, currently LGTM. Thanks @ianthehenry !