prestodb / presto

The official home of the Presto distributed SQL query engine for big data
http://prestodb.io
Apache License 2.0
16.06k stars 5.38k forks source link

Incorrect results in regexp_like #21124

Open mbasmanova opened 1 year ago

mbasmanova commented 1 year ago

regexp_like behaves strangely. I'd expect 'true' in all the following cases, but on of them returns false. This is using JONI for regex library.

CC: @tdcmeehan @aditi-pandit @amitkdutta @zacw7 @kaikalur

presto:whatsapp_closed> select regexp_like('a.b-c.d.e', '[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$');

 _col0
-------
 true

presto> select regexp_like('@a.b-c.d.e', '@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$');

 _col0
-------
 false

presto> select regexp_like('@a.b-c.d.e', '@[a-zA-Z0-9-]+\.[a-zA-Z0-9.-]+$');

 _col0
-------
 true
tdcmeehan commented 1 year ago

Looks like that second character class [a-zA-Z0-9-.] should have the - be interpreted as a character literal, or an error (since it's a special character when it's not in the first or last position of the character class), but it seems to be ignored entirely.

The last example seems to work intuitively: a possible match is a matches the first character class, b-c.d.e matches the second one (hyphen, a special character, is treated as literal since it's the last character in the character class).

The first example only matches a portion of the string, since that's how the regex is defined. So a possible match is b-c to the first character class, then d.e.

tdcmeehan commented 1 year ago

The same pattern works fine in Java pattern, where the hyphen is interpreted as a literal.

Pattern pattern = Pattern.compile("@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$");
Matcher matcher = pattern.matcher("@a.b-c.d.e");
matcher.matches(); // Returns true

Based the comments on jruby/joni#14, it seems any example of a difference between Java pattern syntax and Joni should be considered a bug, so I'd consider raising an issue.