oils-for-unix / oils

Oils is our upgrade path from bash to a better language and runtime. It's also for Python and JavaScript users who avoid shell!
http://www.oilshell.org/
Other
2.81k stars 154 forks source link

Support split by eggex in Str.split #2051

Closed PossiblyAShrub closed 2 weeks ago

PossiblyAShrub commented 3 weeks ago

TODO:


One area of inconsistency between nodejs and python is when the splitter regex accepts the empty string as a match. I also just find the behaviour confusing in general.

This is python's behaviour as documented at https://docs.python.org/3/library/re.html#re.split:

Empty matches for the pattern split the string only when not adjacent to a previous empty match.

>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']

>>>re.split(r'\W*', '...words...')
['', '', 'w', 'o', 'r', 'd', 's', '', '']

>>> re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

Meanwhile for JS https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/Symbol.split#description:

If the current match is an empty string, or if the regexp doesn't match at the current position (since it's sticky), the lastIndex would still be advanced — if the regex is Unicode-aware, it would advance by one Unicode code point; otherwise, it advances by one UTF-16 code unit.

(lastIndex is a property referring to where the regex will start it's next execution: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/lastIndex. It's like what I call cursor in this PR.)

The differences are subtle, but manifest in cases like:

$ node -e 'console.log("a b  cd".split(/\s*/))'
[ 'a', 'b', 'c', 'd' ]
$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
['', 'a', '', 'b', '', 'c', 'd', '']

$ node -e 'console.log("aa".split(/.*/))'      
[ '', '' ]
$ python3 -c 'import re; print(re.split(".*", "aa"))'      
['', '', '']

For YSH I decided to go with Python's behaviour since it was both a) simpler and b) better fit our regex semantics. The JS design heavily relies on a "sticky mode" flag which, AFAIK, we don't have.

PossiblyAShrub commented 3 weeks ago

This is ready for a review. I'd like to get some feedback on the design before I write any docs.

andychu commented 3 weeks ago

Hmm yeah these empty string cases are very subtle, and probably why I put this off for SO long!

The POSIX shSplit() is not exactly simple either -- splitting is hard. (Also the algorithm may still have some bugs! )

But thanks for looking deeply into it, let me take a look

andychu commented 3 weeks ago

python 3.6 deprecated a behavior?

$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
  return _compile(pattern, flags).split(string, maxsplit)
['a', 'b', 'cd']

python 3.11 didn't follow through?

$ python3 -V
Python 3.11.2

$ python3 -c 'import re; print(re.split("\\s*", "a b  cd"))'
['', 'a', '', 'b', '', 'c', 'd', '']

Weird! OK I got nerd sniped by this

andychu commented 3 weeks ago

More digging

https://bugs.python.org/issue43222

There was a bug in the regular expression engine which caused re.split() working incorrectly with zero-width patterns. Note that in your example _DIGIT_BOUNDARY_RE.split("10.0.0") returns ['10.0.0'] on Python 2.7 -- the result which you unlikely expected.

It was impossible to fix that bug without changing behavior of other functions in corner cases and breaking existing code. So we first made re.split() raising an exception instead of returning nonsensical result and added warnings for some other cases to help users to catch potential bugs in their code and avoid ambiguous patterns. You see this in 3.6. In 3.7 we fixed the underlying bug. It caused breakage of some user code, but it made regular expressions more consistent in long perspective and made zero-width patterns more usable.

I want to avoid semantics that we'll have to change later ...

I agree JS semantics are hard to document, which means hard to use


It is possible for us to throw an error on any split expression that matches an empty string?

Is that better / more predictable?

On the other hand, I suppose if we are confident that the current semantic is stable and will never change -- what if we change regex engines?


This actually can be a real issue ... there are significant differences between libc glob() on say OS X and Linux with respect to character classes ...

But I am not sure if that analogy applies here

I suppose if 2 libc's disagree about which regexes match empty string, then that is ORTHOGONAL to our behavior ... we would have a difference either way

But I guess it might not be a silent difference


I guess you have looked into this issue more than me, at this point

We are supposed to be writing an "executalble spec" -- so I guess the Litmus test is if someone is implementing YSH split() in Rust or Zig, without libc regex, without our exact toolchain -- which semantic makes sense?

it's certainly easier to implement without the UTF-8 decoding, which might lean toward an error, but we already have UTF-8 decoding

andychu commented 3 weeks ago

More info - https://pypi.org/project/regex/

Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:

.split won’t split a string at a zero-width match.

.sub will advance by one character after a zero-width match.

can we test if YSH replace() and YSH split() are consistent ??

Do we get an infinite loop in replace() when there is a zero width match ?


This does seem quite fiddly -- maybe we just throw an error in both cases if the match happens to be zero length?

andychu commented 3 weeks ago

OK actually I tested it, it seem like we enter an infinite loop with replace() on a zero width match

ysh andy@hoover:~/git/oilshell/oil$ = 'a b   cd'.replace(/s+/, 'X')
(Str)   'aXbXcd'

ysh andy@hoover:~/git/oilshell/oil$ = 'a b   cd'.replace(/s*/, 'X')

I would lean toward making both cases an ERROR, for simplicity

well I guess we can test what Python replace() does too

But this follows the "one way door" vs. "two way door" philosophy

But given that Python wasn't defnied/good up until Python 3.7, it seems safe to make it disallowed ...

andychu commented 3 weeks ago

Python doesn't enter an infinite loop in re.sub() -- it seems to do something like split() --

>>> re.sub(r'\s+', 'X', 'a b   cd')
'aXbXcd'

>>> re.sub(r'\s*', 'X', 'a b   cd')
'XaXXbXXcXdX'
>>> 

But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it

i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?

andychu commented 3 weeks ago

Another possibility is to return both the replace() or split(), so it returns a value -- i.e. do a partial splitting or replace

But neither Python nor JS do that, so we probably shouldn't

PossiblyAShrub commented 3 weeks ago

Oh, cool! That's for looking into this further :) I did not know that Python used to error here.

But yeah I don't find this behavior interesting/useful -- I think it's better to disallow it

i.e. if there is a match, but the match position is the same, then error in both replace() and split() -- does that make sense?

Agreed on the behavior not being valuable. But re: raising an error, can we do so before we even run the split/replace algorithm? Ie. is is sufficient to error if the regex accepts the empty string?

The main reason why I ask is that I'd like to keep the error cases simple to explain and faster to be detected. If we only error when we encounter a zero-width match, would ' '.split(/ space* /) or ' foo'.split(/ space* /, count=1) "fly under the radar?"

andychu commented 3 weeks ago

Yeah I can see what you mean

Would could be conservative and try to match \s* against the empty string first. If it matches, then it's not valid.

However that may be slow -- it's a little weird to do it every time. We could try to cache that calculation, assuming the regex is constant. Though that feels complex

andychu commented 3 weeks ago

BTW I think we should write split() with the same libc.regex_search() function as we use in replace()

the regex_first_group_match() can be used for other things ... or arguably it could be deleted, since seems like a special case of regex_search

andychu commented 3 weeks ago

Oh I see you added the rule for matching against '' first

That does seem pretty principled

I am also wondering about the rule of just stopping when there is an infinite loop. And then the docs will call this out as a special case that should be avoided

I guess our usual rule is to follow Python and JS. In this case I don't think they have very good guidance


strip() and replace() should be consistent

So I think the choices are:

  1. check that '' is not matched - we are kinda "taxing" the common case for an uncommon case - that's my main concern
  2. bail early when there is an infinite loop
  3. error when there is an infinite loop
  4. advance by one byte/rune - this behavior is a bit weird to me

I think (1) is the most principled but I worry about performance, especially because we have an existing perf issue where we repeatedly compile the regex EVERY time you call libc.regex_search()

Melvin had a regex cache PR to solve this problem, but the performance was a bit complex, and it hasn't landed yet


(2) seems OK -- it is well defined

(3) also seems OK but I get the point that it may cause "surprise" errors?? Though IMO a runtime error is probably better than doing something wrong/weird

(4) is a little weird as mentioned

The "split the difference" choice might be to make 2 the default, and opt into 3, or vice versa ... maybe overthinking it though

andychu commented 3 weeks ago

fwiw the regex cache PR was #1770

the issue is that with mycpp it's hard to pass a compiled regex_t object from C++ <-> Pythonnn

So we just pass a string and compile it every time, which is a little lame

The cache is a good idea, but it's not 100% straightforward to implement

PossiblyAShrub commented 2 weeks ago

I've been thinking about this a bit and I still feel that (1) is the best choice for semantics. But yeah, the performance is a good point. And it's not like this is an unsafe API. Unlike eg. array bounds, checking the impact of it's misuse at worst results in a runtime error. Additionally, triggering that error case should be pretty easy with testing. You have to have some very specific forms of inputs to miss the infinite loop case.

Now (2) vs. (3), I agree that a runtime error is best here. If, somehow, someone finds that the behaviour like (2) or (4) or something else is actually desirable, then we leave ourselves the option of implementing it. It's a one-way door situation again.

So I'll go ahead and implement (3) then.

andychu commented 2 weeks ago

Ah OK cool, yeah it was bubbling in the background for me too ... there's no perfect solution, but I'm happy with (3) :)

PossiblyAShrub commented 2 weeks ago

This is now ready for another review. Thanks for the help on the design :) It was a little trickier than it seemed. After this PR I'll do something similar for replace().

PossiblyAShrub commented 2 weeks ago

This is ready for another review.