Why not just escape every character? - Githubissues

tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard

http://tc39.es/proposal-regex-escaping/

Creative Commons Zero v1.0 Universal

364 stars 32 forks source link

Why not just escape every character? #15

Closed domenic closed 9 years ago

domenic commented 9 years ago

Is there any reason to only escape a specific subset? It's harmless to add slashes, right?

benjamingr commented 9 years ago

It makes the resulting string longer, other than that it's harmless.

This is what some programming languages (Python escapes non alphanumeric strings) do where others escape a strict set (like C#).

domenic commented 9 years ago

Might be worth mentioning this as a design alternative in the readme, with the pro that it's more future-proof.

benjamingr commented 9 years ago

Good idea, I'll add that when I'm in front of a computer :) (you're welcome to if you'd like of course).

benjamingr commented 9 years ago

Updated the README, I'll leave this open for a week to see if anyone has any further input on it.

benjamingr commented 9 years ago

Following the research https://github.com/benjamingr/RegExp.escape/blob/master/data/other_languages/discussions.md it appears that other languages that used to escape every character have either made exceptions (like Python) or changed it (like Perl). The discussion notes contain links to posts with reasons on why changes were made.

mjpieters commented 9 years ago

Python's new regex engine (under development) gives you a choice; either escape all non-alphanumerics, or only metacharacters (and NUL), see https://bitbucket.org/mrabarnett/mrab-regex/src/6193ea4246da272cf18a190c46aa116737067780/regex_3/Python/regex.py?at=default#cl-342

In your discussion you mentioned a problem with wide characters; you ran into the Python re limitations with UCS-2 vs. UCS-4 builds (all Python versions up to 3.2 use one or the other based on a compile-time switch), the regular expression engine does not handle codepoints but codeunits, which in a UCS-2 build means 2 per non-BMP character. The escaping is correct for their respective builds.

benjamingr commented 9 years ago

I think we're good with not escaping every character. I want to focus on the discussion about big set vs readable set.