tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
364 stars 32 forks source link

Results of TC39 presentation #37

Closed domenic closed 3 years ago

domenic commented 9 years ago

I'm sorry to say that the committee declined to accept this proposal as-is. In the end, the concern (largely driven by @erights, although others were sympathetic) was that escaping cannot be done in a way that is totally safe in all cases, even with the extended safe set. For example,

new RegExp("\\" + RegExp.escape("w"))

is a hazard. (It does not matter that "\\" by itself is not a valid regex fragment. The above does not error to indicate that; it just silently creates a bug.)

Note that even if you attempted to correct this by escaping all initial characters, you then have

new RegExp("\\\\" + RegExp.escape("w"))

as a bug. @erights called this the "even-odd problem."

The general feeling was that to be completely safe you need a context-dependent join operation. The feeling was then that if author code wants to do unsafe escaping, the function is easy to write, but if something is going to be standardized, it must be completely safe. The idea that other languages are not held to this standard did not convince them, I'm sorry to say.

The committee recognized that you might not be willing to do work on a different, more complicated proposal. But, if you were interested, they think that a template string tag solution would be the best to investigate. Such a solution would be able to use the context of each insertion to do a safe join operation between any two fragments, instead of depending on string concatenation. Template strings can also be twisted to work in dynamic situations (e.g. those that this proposal would cover via new RegExp(pieces.map(RegExp.escape).join("/"))) by directly calling the tag function, probably with an adapter to work through the awkwardness of the parameter form for template tags. So this would be strictly more powerful. This was also preferred (for reasons I don't really remember) to a building-block approach of e.g. RegExp.concat plus RegExp.escape (used as RegExp.concat("\\", RegExp.escape(x))).

I'm pretty disappointed by this, and am sorry you and others sunk so much work into it with such an outcome. But, what can we do.

l-cornelius-dol commented 3 years ago

@benjamingr : My apologies -- I thought I was on a TC39 thread here. I've been multitasking too many things this morning. For whatever it's worth, I strongly support your proposal, and it's unfathomable to me how it was rejected.

coolaj86 commented 2 years ago

I fundamentally don't understand the issue that's being raised by "the even-odd problem":

new RegExp("\\" + RegExp.escape("w"))

// same as
new RegExp("\\w")

// same as
/\w/
/\w/.test("a") // true, as expected

Okay, that makes sense. If I only partially escape my input... why should I expect the input to be safe?

I should instead do:

new RegExp(RegExp.escape("\\") + RegExp.escape("w"))

// same as
new RegExp("\\\\w")

// same as
/\\w/
/\\w/.test("a") // false, as expected
/^\\w$/.test("\\w") // true, as expected

To me this seems perfectly intuitive - I can't ever use an odd number of \ because it's an escape character that must be escaped.

Maybe "intuitive" is a strong word... "rational", let's say.

What is the "problem" part of "the even-odd problem"?

Is it that some developers don't know that you have to escape \ in a string and therefore don't know that you also have to escape the escaped \ in a RegExp?

Is that you must always have an even number of \? \ (that's just have escaping works, right... 🤷‍♂️)

If you have an odd number of \ then one of:

What about this behavior is unexpected or unsafe?

I don't get it.

It seems to behave in a perfectly predictable way that anyone familiar with basic strings should be able to reason about.

Can someone explain what the problem is?

msikma commented 2 years ago

I don't get it. It seems to behave in a perfectly predictable way that anyone familiar with basic strings should be able to reason about. Can someone explain what the problem is?

There is no problem. I'm afraid there is no nice way of saying this, and I hate to be this confrontational, but this is entirely a consequence of @erights fundamentally misunderstanding what escape functions are, and sticking to his guns even after multiple people have tried to explain it to him.

A regex escape function is trivial to create, tons of languages have them, and there's a common npm package and even Stack Overflow code snippet that will do the job. That npm package has 67 million weekly downloads. All of these work perfectly fine and are used daily without a problem—the Python docs don't even mention the supposed deal-breaking issue brought up here because it's perfectly obvious. As you pointed out, escape functions necessarily have a scope in which they are expected to work, and if you use them improperly they'll fail in predictable and documentable ways. This is a categorical truth about any kind of escape function.

By the same logic we can say that there should be no encodeURI() since you can do decodeURI('%' + encodeURI('%')) and that's invalid. No one should be "surprised" by this. If you surround escaped output by characters that are escapable or can escape others, like slashes in this case, you are misusing the function.

Then after all of this was kindly pointed out by people who actually understand how this works, the excuse became that we need to look after the "average programmer" and restrict ourselves to just making a tagged template because it makes it harder to screw up. Even though it's obviously still trivial to misuse and it's clearly not the favored interface for a function like this, as evidenced by the myriad escape function libraries that are not tagged templates (can anyone even find one in the wild that is?)

Everybody here is being extremely charitable and walking on eggshells out in fear of upsetting anyone from the committee, who are clearly ready to torpedo a good proposal and set us all back multiple years (this issue was opened 7 years ago) if they happen to be feeling like it on that day, but there actually just is no technical explanation for why this still isn't even on track to become part of the standard.

The real explanation is that the committee has colossally failed in its duty due to a lack of technical understanding (and probably so due to one single individual), and this whole proposal is now a good example of how you should never assume that TC39 committee members know what they're talking about or are amenable to technical arguments.

benjamingr commented 2 years ago

@msikma this is currently blocked on work not on Mark if we're honest. That said he is still very much against RegExp.escape and I'm opposed to a template-tag version that is a lot less ergonomic and frankly what users are asking for.

Honestly the (time and emotional) cost of standardization is such that I might end up just implementing this in Node and ask browsers to add it as a WHATWG API and we'll circumvent TC39?

ljharb commented 2 years ago

@benjamingr please don’t do that; that is a much worse outcome than never having it at all.

ljharb commented 2 years ago

@msikma it is inaccurate that it’s one individual; there are multiple people on the committee who feel the same as Mark does.

Please avoid personal attacks; that’s not in keeping with our CoC.

benjamingr commented 2 years ago

@benjamingr please don’t do that; that is a much worse outcome than never having it at all.

I feel like if I go through here I will spend another 5 years waiting and the API people are actually asking for (.escape) won't happen. I fail, I fail the users who have been rooting for me, I fail the ecosystem.

On the other hand if I make a PR to Node and coordinate with WHATWG I get the API I've been asking for since 2015 in about a week. The interaction with WHATWG has been significantly easier than TC39, browsers would (most likely) also want this and Node sure does.

So please help me understand why I would rather go through TC39 for something that is borderline platform and not language anyway (in other languages and ecosystems it's half-and-half)?

I like you and (full honesty) I like Mark. Both of you have helped me in the past with a bunch of things when I needed help both in private and in public. What I don't like is the process and how skewed it is towards concerns that our users in Node frankly don't care about as you can see from this (and the other repo issues).

Please give me a reason to trust the committee on this or faith this will progress at some point.

ljharb commented 2 years ago

The presentation I made that reactivated the proposal at stage 1 is the entire committee giving consensus to reconsider it, which woudn't have happened if anyone's position remained immovable. I'm sorry I haven't made the time to do more work on it since then, but I can assure you that, at least, I am confident it has a real shot at advancement.

benjamingr commented 2 years ago

@ljharb ok, let me know how it goes