Closed ianthehenry closed 6 months ago
With respect to peg/split
, perhaps there will be requests for it later and if so, addition can be considered at that time? (Although I have wanted something along those lines before and imagine others have too.)
For any other folks who might be checking this PR out, highly recommended are the tests for gaining some familiarity.
As the tests demonstrate, it seems worth noting that captures get dropped from the separator pattern and that the separator pattern doesn't have to match anything.
That the "separator can be an arbitrary PEG" is especially nice (possibly one of the main motivators?) and what inclines me toward a peg/split
function (^^;
I went ahead and added a peg/split
helper function too; it seems like it's generally useful.
Working well here so far.
Nice that the start
and args
equivalents of peg/match
seem to work too.
So I've been studying RULE_SPLIT
using gdb
(actually rr
[1]) a bit and for the following Janet code (one of the tests I think):
(peg/match ~(split "," (capture (to -1)))
"a,,bar,c")
when RULE_SPLIT
is done and returning, text
seems to point to what seems like garbage "\031\373\207\253\325\376\377A"
-- it's one byte beyond the terminating zero byte I think (see below for more details).
What I'm seeing in the locals view is this:
const uint8_t * text 0x55ab899fb241 "\031\373\207\253\325\376\377A"
The rest of the locals view at the time looks like this:
const uint8_t * saved_end 0x55ab899fb240 ""
const uint32_t * rule_separator 0x55ab899fbdec
const uint32_t * rule_subpattern 0x55ab899fbdf8
const uint8_t * separator_end 0x0
PegState * s 0x7ffceeb16f98
const uint32_t * rule 0x55ab899fbde0
saved_end
is at 0x55ab899fb240
-- this is one byte before text
0x55ab899fb241
.
rr
says:
(rr) p s->text_start
$1 = (const uint8_t *) 0x55ab899fb238 "a,,bar,c"
(rr) p saved_end
$2 = (const uint8_t *) 0x55ab899fb240 ""
(rr) p s->text_start + 8
$3 = (const uint8_t *) 0x55ab899fb240 ""
(rr) p *(s->text_start + 8)
$4 = 0 '\000'
(rr) p s->text_start + 9
$5 = (const uint8_t *) 0x55ab899fb241 "\031\373\207\253\325\376\377A"
IIUC, s->text_start + 8
is the same as saved_end
and it points to the terminating zero byte.
The code works fine here but this feels odd. May be it's nothing to worry about?
[1] See this gist for using rr
with janet if interested.
Hey thanks! That's interesting. It does seem wrong to advance past the input; I don't know if anything bad would happen but it seems plausible. I don't think any other rule does it, so split
probably shouldn't either.
I just pushed a fix although I'm not really sure how to write an automated test for this; I'll try harder when I get off work.
I've no clue how one would write a test for this.
Perhaps it isn't necessary to do so.
I found a test that failed with the old behavior but not the new behavior: the rule 0
should be a no-op that always succeeds, but (* (split ...) 0)
would fail because it had advanced past the end of the input. This works as expected after the fix this morning. Great catch!
Nice that you found that test and thanks for the fix :tada:
So I don't like how peg/split
doesn't accept precompiled PEGs unlike all of the other peg functions. I think we should remove that particular fuction for that reason, or implement it so that that works.
the TO_END rule also looks good and seems to be a good way to possibly speed up some matches. That said, I think for more optimizations a benchmark suite is needed. I'm reluctant to add performance optimizations without measurement unless they are trivial.
Oh yeah, I didn't think of that! It would be possible but maybe crazy to add a RULE_RUN_THIS_OTHER_PEG
and allow interpolating pre-compiled PEGs inside arbitrary PEGs. Or to copy in their bytecode/constants with adjusted offsets. But probably the simplest thing is to implement peg/split
as a separate cfunction.
I just removed that commit for now; I'm not really convinced the function is pulling its weight if it requires re-implementing the split logic.
Ok, currently LGTM. Thanks @ianthehenry !
This works similarly to
string/split
, but the separator is a PEG.I also added a
peg/split
helper function, and a new opcode to optimize the(to -1)
pattern so that it will only do two passes over the input instead of three.