Closed PossiblyAShrub closed 2 months ago
This is ready for a review. I'd like to get some feedback on the design before I write any docs.
Hmm yeah these empty string cases are very subtle, and probably why I put this off for SO long!
The POSIX shSplit() is not exactly simple either -- splitting is hard. (Also the algorithm may still have some bugs! )
But thanks for looking deeply into it, let me take a look
$ python3 -c 'import re; print(re.split("\\s*", "a b cd"))'
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
return _compile(pattern, flags).split(string, maxsplit)
['a', 'b', 'cd']
$ python3 -V
Python 3.11.2
$ python3 -c 'import re; print(re.split("\\s*", "a b cd"))'
['', 'a', '', 'b', '', 'c', 'd', '']
Weird! OK I got nerd sniped by this
More digging
https://bugs.python.org/issue43222
There was a bug in the regular expression engine which caused re.split() working incorrectly with zero-width patterns. Note that in your example _DIGIT_BOUNDARY_RE.split("10.0.0") returns ['10.0.0'] on Python 2.7 -- the result which you unlikely expected.
It was impossible to fix that bug without changing behavior of other functions in corner cases and breaking existing code. So we first made re.split() raising an exception instead of returning nonsensical result and added warnings for some other cases to help users to catch potential bugs in their code and avoid ambiguous patterns. You see this in 3.6. In 3.7 we fixed the underlying bug. It caused breakage of some user code, but it made regular expressions more consistent in long perspective and made zero-width patterns more usable.
I want to avoid semantics that we'll have to change later ...
I agree JS semantics are hard to document, which means hard to use
It is possible for us to throw an error on any split expression that matches an empty string?
Is that better / more predictable?
On the other hand, I suppose if we are confident that the current semantic is stable and will never change -- what if we change regex engines?
This actually can be a real issue ... there are significant differences between libc glob() on say OS X and Linux with respect to character classes ...
But I am not sure if that analogy applies here
I suppose if 2 libc's disagree about which regexes match empty string, then that is ORTHOGONAL to our behavior ... we would have a difference either way
But I guess it might not be a silent difference
I guess you have looked into this issue more than me, at this point
We are supposed to be writing an "executalble spec" -- so I guess the Litmus test is if someone is implementing YSH split() in Rust or Zig, without libc regex, without our exact toolchain -- which semantic makes sense?
it's certainly easier to implement without the UTF-8 decoding, which might lean toward an error, but we already have UTF-8 decoding
More info - https://pypi.org/project/regex/
Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:
.split won’t split a string at a zero-width match.
.sub will advance by one character after a zero-width match.
can we test if YSH replace()
and YSH split()
are consistent ??
Do we get an infinite loop in replace()
when there is a zero width match ?
This does seem quite fiddly -- maybe we just throw an error in both cases if the match happens to be zero length?
OK actually I tested it, it seem like we enter an infinite loop with replace() on a zero width match
ysh andy@hoover:~/git/oilshell/oil$ = 'a b cd'.replace(/s+/, 'X')
(Str) 'aXbXcd'
ysh andy@hoover:~/git/oilshell/oil$ = 'a b cd'.replace(/s*/, 'X')
I would lean toward making both cases an ERROR, for simplicity
well I guess we can test what Python replace() does too
But this follows the "one way door" vs. "two way door" philosophy
But given that Python wasn't defnied/good up until Python 3.7, it seems safe to make it disallowed ...
Python doesn't enter an infinite loop in re.sub()
-- it seems to do something like split()
--
>>> re.sub(r'\s+', 'X', 'a b cd')
'aXbXcd'
>>> re.sub(r'\s*', 'X', 'a b cd')
'XaXXbXXcXdX'
>>>
But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it
i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?
Another possibility is to return both the replace() or split(), so it returns a value -- i.e. do a partial splitting or replace
But neither Python nor JS do that, so we probably shouldn't
Oh, cool! That's for looking into this further :) I did not know that Python used to error here.
But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it
i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?
Agreed on the behavior not being valuable. But re: raising an error, can we do so before we even run the split/replace algorithm? Ie. is is sufficient to error if the regex accepts the empty string?
The main reason why I ask is that I'd like to keep the error cases simple to explain and faster to be detected. If we only error when we encounter a zero-width match, would ' '.split(/ space* /)
or ' foo'.split(/ space* /, count=1)
"fly under the radar?"
Yeah I can see what you mean
\s*
works but ONLY if you don't hit that case at runtime
Would could be conservative and try to match \s*
against the empty string first. If it matches, then it's not valid.
However that may be slow -- it's a little weird to do it every time. We could try to cache that calculation, assuming the regex is constant. Though that feels complex
BTW I think we should write split()
with the same libc.regex_search()
function as we use in replace()
the regex_first_group_match()
can be used for other things ... or arguably it could be deleted, since seems like a special case of regex_search
Oh I see you added the rule for matching against ''
first
That does seem pretty principled
I am also wondering about the rule of just stopping when there is an infinite loop. And then the docs will call this out as a special case that should be avoided
I guess our usual rule is to follow Python and JS. In this case I don't think they have very good guidance
strip() and replace() should be consistent
So I think the choices are:
''
is not matched - we are kinda "taxing" the common case for an uncommon case - that's my main concernI think (1) is the most principled but I worry about performance, especially because we have an existing perf issue where we repeatedly compile the regex EVERY time you call libc.regex_search()
Melvin had a regex cache PR to solve this problem, but the performance was a bit complex, and it hasn't landed yet
(2) seems OK -- it is well defined
(3) also seems OK but I get the point that it may cause "surprise" errors?? Though IMO a runtime error is probably better than doing something wrong/weird
(4) is a little weird as mentioned
The "split the difference" choice might be to make 2 the default, and opt into 3, or vice versa ... maybe overthinking it though
fwiw the regex cache PR was #1770
the issue is that with mycpp it's hard to pass a compiled regex_t
object from C++ <-> Pythonnn
So we just pass a string and compile it every time, which is a little lame
The cache is a good idea, but it's not 100% straightforward to implement
I've been thinking about this a bit and I still feel that (1) is the best choice for semantics. But yeah, the performance is a good point. And it's not like this is an unsafe API. Unlike eg. array bounds, checking the impact of it's misuse at worst results in a runtime error. Additionally, triggering that error case should be pretty easy with testing. You have to have some very specific forms of inputs to miss the infinite loop case.
Now (2) vs. (3), I agree that a runtime error is best here. If, somehow, someone finds that the behaviour like (2) or (4) or something else is actually desirable, then we leave ourselves the option of implementing it. It's a one-way door situation again.
So I'll go ahead and implement (3) then.
Ah OK cool, yeah it was bubbling in the background for me too ... there's no perfect solution, but I'm happy with (3) :)
This is now ready for another review. Thanks for the help on the design :) It was a little trickier than it seemed. After this PR I'll do something similar for replace()
.
This is ready for another review.
TODO:
One area of inconsistency between nodejs and python is when the splitter regex accepts the empty string as a match. I also just find the behaviour confusing in general.
This is python's behaviour as documented at https://docs.python.org/3/library/re.html#re.split:
Meanwhile for JS https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/Symbol.split#description:
(
lastIndex
is a property referring to where the regex will start it's next execution: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex. It's like what I callcursor
in this PR.)The differences are subtle, but manifest in cases like:
For YSH I decided to go with Python's behaviour since it was both a) simpler and b) better fit our regex semantics. The JS design heavily relies on a "sticky mode" flag which, AFAIK, we don't have.