Suggestion: Regex-validated string type

microsoft / TypeScript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

https://www.typescriptlang.org

Apache License 2.0

100.16k stars 12.38k forks source link

Suggestion: Regex-validated string type #6579

Closed tenry92 closed 3 years ago

tenry92 commented 8 years ago

There are cases, where a property can not just be any string (or a set of strings), but needs to match a pattern.

let fontStyle: 'normal' | 'italic' = 'normal'; // already available in master
let fontColor: /^#([0-9a-f]{3}|[0-9a-f]{6})$/i = '#000'; // my suggestion

It's common practice in JavaScript to store color values in css notation, such as in the css style reflection of DOM nodes or various 3rd party libraries.

What do you think?

amir-arad commented 5 years ago

@m93a my suggestion was for type annotations, not javascript.

@Igmat from the top of my head, how about:

interface RegExp {
    test(stringToTest: string): stringToTest is string<this>;
}

Igmat commented 5 years ago

@amir-arad, sorry, I can't add more valuable details to your suggestion, but at first glance it looks like very significant change to whole TS compiler, because string is very basic primitive.

Even though I don't see any obvious problems, I think such proposal should be much more detailed and cover a lot of existing scenarios plus proper justification of its purpose. Your proposal adds one type and changes one primitive type, while mine only adds one type.

Unfortunately, I'm not ready to dedicate a lot of time to creating proposal for such feature (also, you may noticed that not every proposal in TS has been implemented without significant delay), but if you'll work on this, I'll be glad to provide you with my feedback if needed.

reverofevil commented 5 years ago

If these regexp-types were real regular expressions (not Perl-like regular expressions that are not regular) we could translate them to deterministic FSM and use cartesian product construction on those to get all the conjunctions and disjunctions. Regular expressions are closed under boolean operations.

Also if string literal types were not atomic, but represented as compile-time character lists, it would allow to implement all the operators in libraries. That would only worsen performance a bit.

dead-claudia commented 5 years ago

Edit: Fix a mistake.

Dropping in to note that Mithril could really use these, and being type-safe in the general case is nearly impossible without it. This is the case both with hyperscript and JSX syntax. (We support both.)

Our lifecycle hooks, oninit, oncreate, onbeforeupdate, onupdate, onbeforeremove, and onremove, have their own special prototypes.
Event handlers on DOM vnodes are literally anything else that starts with on, and we support both event listener functions and event listener objects (with handleEvent methods), aligning with addEventListener and removeEventListener.
We support keys and refs as appropriate.
Everything else is treated as an attribute or property, depending on their existence on the backing DOM node itself.

So with a regex-validated string type + type negation, we could do the following for DOM vnodes:

interface BaseAttributes {
    // Lifecycle attributes
    oninit(vnode: Vnode<this, Vnode<Attributes, []>>): void;
    oncreate(vnode: Vnode<this, Vnode<Attributes, []>>): void;
    onbeforeupdate(
        vnode: Vnode<this, Vnode<Attributes, []>>,
        old: Vnode<this, Vnode<Attributes, []>>
    ): void;
    onupdate(vnode: Vnode<this, Vnode<Attributes, []>>): void;
    onbeforeremove(vnode: Vnode<this, Vnode<Attributes, []>>): void | Promise<void>;
    onremove(vnode: Vnode<this, Vnode<Attributes, []>>): void;

    // Control attributes
    key: PropertyKey;
}

interface DOMAttributes extends BaseAttributes {
    // Event handlers
    [key: /^on/ & not keyof BaseAttributes]: (
        ((this: Element, ev: Event) => void | boolean) |
        {handleEvent(ev: Event): void}
    );

    // Other attributes
    [key: keyof HTMLElement & not keyof BaseAttributes & not /^on/]: any;
    [key: string & not keyof BaseAttributes & not /^on/]: string;
}

interface ComponentAttributes extends BaseAttributes {
    // Nothing else interesting unless components define them.
}

(It'd also be nice to be able to extract groups from such regexes, but I'm not going to hold my breath on that.)

dead-claudia commented 5 years ago

Edit: Clarify a few critical details in the proposal. Edit 2: Correct the technical bit to actually be mathematically accurate. Edit 3: Add support for generic starring of single-character unions

Here's a concrete proposal to attempt to solve this much more feasibly: template literal types.

Also, I feel full regexps are probably not a good idea, because it should be reasonably easy to merge with other types. Maybe this might be better: template literal types.

`value` - This is literally equivalent to "value"
`value${"a" | "b"}` - This is literally equivalent to "valuea" | "valueb"
`value${string}` - This is functionally equivalent to the regexp /^value/, but "value", "valuea", and "valueakjsfbf aflksfief fskdf d" are all assignable to it.
`foo${string}bar` - This is functionally equivalent to the regexp /^foo.*bar$/, but is a little easier to normalize.
There can, of course, be multiple interpolations. `foo${string}bar${string}baz` is a valid template literal type.
Interpolations must extend string, and it must not be recursive. (The second condition is for technical reasons.)
A template literal type A is assignable to a template literal type B if and only if the set of strings assignable to A is a subset of the set of strings assignable to B.

In addition to the above, a special starof T type would exist, where T must consist of only single-character string literal types. string would exist as a type alias of starof (...), where ... is the union of all single UCS-2 character string literals from U+0000 to U+FFFF, including lone surrogates. This lets you define the full grammar for ES base-10 numeric literals, for instance:

type DecimalDigit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
type Decimal = `${DecimalDigit}{starof DecimalDigit}`

type Numeric = `${(
    | Decimal
    | `${Decimal}.${starof DecimalDigit}`
    | `.${Decimal}`
)}${"" | (
    | `E${Decimal}` | `E+${Decimal}` | `E-${Decimal}`
    | `e${Decimal}` | `e+${Decimal}` | `e-${Decimal}`
)}`

And likewise, certain built-in methods can be adjusted to return such types:

Number.prototype.toString(base?) - This can return the above Numeric type or some variant of it for statically-known bases.
+x, x | 0, parseInt(x), and similar - When x is known to be a Numeric as defined above, the resulting type can be inferred appropriately as a literal number type.

And finally, you can extract matched groups like so: Key extends `on${infer EventName}` ? EventTypeMap[TagName][EventName] : never. Template extraction assumes it's always working with full names, so you have to explicitly use ${string} interpolations to search for aribitrary inclusion. This is non-greedy, so ` "foo.bar.baz" extends${infer T}.${infer U}? [T, U] : never returns ["foo", "bar.baz"], not ["foo.bar", "baz"].

From a technical standpoint, this is a lot more feasible to implement than raw regexps. JS regexps aren't even regular - they become context-sensitive with back-references, and they involve a lot of complexity in the form of As long as you block recursion with these, template literal types generate a single regular language each, one that aligns very closely with the underlying theory (but supports only a subset of it).

Empty language: ""
Union: "a" | "b"
Concatenation: `${a}${b}`
Kleene star (partial): starof T (T can only contain single characters and unions.)

This may make string subtyping checking a subset of the subgraph isomorphism problem worst case scenario, but there are a few big redeeming factors here:

The common case by far is unions of small finite strings, something you can model with trees. This is relatively obvious to work with. (I don't recommend trying to join them as ropes, since that will complicate the above matching algorithm, but it's perfectly fine to normalize single-character unions and similar into a single split + join.)
You can model the entire unified type as a directed graph, where:
1. Starred unions of such characters are subgraphs where the parent node has edges both to each character and each child nodes of the subgraph, and each character has edges to both all other characters and all child nodes of the subgraph.
2. The rest of the graph holds a directed tree-like structure representing all other possibilities.
According to this Math.SE chat I was briefly in (starting approximately here), I found that this resulting graph would have both a bounded genus (i.e. with a finite number of jumps over other edges*) and, absent any starof types, a bounded degree. This means type equality reduces that to a polynomial-time problem and assuming you normalize unions, it's also not super slow as it's only somewhat faster than tree equality. I strongly suspect the general case for this entire proposal (a subset of the subgraph isomorphism problem) is also polynomial-time with reasonable coefficients. (The Wikipedia article linked above has some examples in the "Algorithms" and references sections where special casing might apply.)
None of these keys are likely to be large, so most of the actual runtime cost here is amortized in practice by other things. As long as it's fast for small keys, it's good enough.
All subgraphs that would be compared share at least one node: the root node. (This represents the start of the string.) So this would dramatically reduce the problem space just on its own and guarantee a polynomial time check.

And of course, intersection between such types is non-trivial, but I feel similar redeeming factors exist simply due to the above restrictions. In particular, the last restriction makes it obviously polynomial-time to do.

* Mathematically, genus is defined a bit counterintuitively for us programmers (the minimum number of holes you need to poke in a surface to draw the graph without any jumps), but a bounded genus (limited number of holes) implies a limited number of jumps.

dead-claudia commented 5 years ago

Using this concrete proposal, here's how my example from this comment translates:

// This would work as a *full* type implementation mod implementations of `HTMLTypeMap` +
// `HTMLEventMap`
type BaseAttributes = {
    // Lifecycle attributes
    oninit(vnode: Vnode<this, Vnode<Attributes, []>>): void;
    oncreate(vnode: Vnode<this, Vnode<Attributes, []>>): void;
    onbeforeupdate(
        vnode: Vnode<this, Vnode<Attributes, []>>,
        old: Vnode<this, Vnode<Attributes, []>>
    ): void;
    onupdate(vnode: Vnode<this, Vnode<Attributes, []>>): void;
    onbeforeremove(vnode: Vnode<this, Vnode<Attributes, []>>): void | Promise<void>;
    onremove(vnode: Vnode<this, Vnode<Attributes, []>>): void;

    // Control attributes
    key: PropertyKey;
}

interface HTMLTypeMap {
    // ...
}

interface HTMLEventMap {
    // ...
}

// Just asserting a simple constraint
type _Assert<T extends true> = never;
type _Test0 = _Assert<
    keyof HTMLTypeMap[keyof HTMLTypeMap] extends `on${string}` ? false : true
>;

type EventHandler<Event> =
    ((this: Element, ev: Event) => void | boolean) |
    {handleEvent(ev: Event): void};

type Optional<T> = {[P in keyof T]?: T[P] | null | undefined | void}

type DOMAttributes<T extends keyof HTMLAttributeMap> = Optional<(
    & BaseAttributes
    & {[K in `on${keyof HTMLEventMap[T]}` & not keyof BaseAttributes]: EventHandler<(
        K extends `on${infer E}` ? HTMLEventMap[E] : never
    )>}
    & Record<
        keyof `on${string & not keyof HTMLEventMap}` & not keyof BaseAttributes,
        EventHandler<Event>
    >
    & Pick<HTMLTypeMap[T], (
        & keyof HTMLTypeMap[T]
        & not `on${string}`
        & not keyof BaseAttributes
    )>
    & Record<(
        & string
        & not keyof HTMLTypeMap[T]
        & not keyof BaseAttributes
        & not `on${string}`
    ), string | boolean>
)>;

Edit: This would also enable properly typing 90% of Lodash's _.get method and related methods using its property shorthand, like its _.property(path) method and its _.map(coll, path) shorthand. There's probably several others I'm not thinking of, too, but that's probably the biggest one I can think of. (I'm going to leave the implementation of that type as an exercise to the reader, but I can assure you it's possible with a combination of that and the usual trick of conditional types with an immediately-indexed record, something like {0: ..., 1: ...}[Path extends "" ? 0 : 1], to process the static path string.)

ozyman42 commented 5 years ago

My recommendation is that we focus our efforts on implementing type providers, which could be used to implement regex types.

Why type providers instead of directly implementing regex types? Because

It’s a more generic solution that adds many new possibilities to TypeScript making it easier to get support from a wider group of developers beyond those who see the value in regex string types.
The typescript repo owners seem to be open to this idea, and are waiting for the right proposal. See #3136

F# has an open source regex type provider.

Some info on type providers: https://link.medium.com/0wS7vgaDQV

One could imagine that once type providers are implemented and the regex type provider is implemented as an open source library, one would use it like so:

type PhoneNumber = RegexProvider</^\d{3}-\d{3}-\d{4}$/>
const acceptableNumber: PhoneNumber = "123-456-7890"; //  no compiler error
const unacceptableNumber: PhoneNumber = "hello world"; // compiler error

dead-claudia commented 5 years ago

@AlexLeung I'm not convinced that's the correct way to go, at least not for this request.

TypeScript is structurally typed, not nominally typed, and for string literal manipulation, I want to retain that structural spirit. Type providers like that would create a nominal string subtype where RegexProvider</^foo$/> would not be treated as equivalent to "foo", but a nominal subtype of it. Furthermore, RegexProvider</^foo$/> and RegexProvider</^fo{2}$/> would be treated as two distinct types, and that's something I'm not a fan of. My proposal instead directly integrates with strings at their core, directly informed by the theory of formal language recognition to ensure it fits in naturally.
With mine, you can not only concatenate strings, but extract parts of strings via Key extends `on${infer K}` ? K : never or even Key extends `${Prefix}${infer Rest}` ? Rest : never. Type providers do not offer this functionality, and there's no clear way how it should if such functionality were to be added.
Mine is considerably simpler at the conceptual level: I'm just suggesting we add string concatenation types and, for the RHS of conditional types, the ability to extract its inverse. I also propose that it integrate with string itself to take the place of a regexp /.*/. It requires no API changes, and aside from the two theoretically complex parts that are mostly decoupled from the rest of the code base, calculating whether a template literal type is assignable to another and extracting a slice from a string, is similar, if not simpler, to implement.

BTW, my proposal could still type that PhoneNumber example, too. It's a bit more verbose, but I'm trying to model data that's already in TS land, not data that exists elsewhere (what F#'s type providers are most useful for). (It's worth noting this would technically expand to the full list of possible phone numbers here.)

type D = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
type PhoneNumber = `${D}${D}${D}-${D}${D}${D}-${D}${D}${D}${D}`;

ozyman42 commented 5 years ago

RegexProvider</^foo$/> and RegexProvider</^fo{2}$/> would be treated as two distinct types

Type providers could require the implementation of some equals or compare method, so that the type provider author of a regex type provider could define that both cases above are equivalent types. The type provider author could implement structural or nominal typing as they please.

Perhaps it would be possible to implement your string literal type as a type provider as well. I don't think the syntax could be the same, but you could get close with a type provider which takes in a variable number of arguments.

type D = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9";
type PhoneNumber = StringTemplateMatcherProvider<D, D, D, "-", D, D, D, "-", D, D, D, D>;

dead-claudia commented 5 years ago

@AlexLeung But is the type "123-456-7890" assignable to your type? (If so, that'll complicate implementation and slow down the checker a lot.)

jhpratt commented 5 years ago

Semi-related to the discussion at hand, what if the type isn't of a fixed length (like a phone number)? One situation where I would've liked to use this recently is for storing a room name, of the format thread_{number}.

The regex to match such a value is thread_[1-9]\d*. With what is being proposed, it doesn't seem feasible (or even possible) to match such a format. The numerical part of the value could be any length greater than zero in this situation.

dead-claudia commented 5 years ago

@jhpratt I revised my proposal to accommodate that, in the form of starof ("0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9") ↔ /^\d*$/, since it only required a small change to it. It optimizes the same way string optimizes as /^[\u0000-\uFFFF]*$/, so I decided to go ahead and generalize that.

I don't want to extend starof further than that, like accepting arbitrary non-recursive unions, due to computational complexity concerns: verifying if two arbitrary regular expressions* are equivalent can be done in polynomial space or polynomial time (convert both to minimal DFA and compare - the usual way, but very slow in practice), but both ways are very slow in practice and AFAICT you can't have it both ways. Add support for squaring (like a{2}), and it's basically infeasible (exponential complexity). This is only for equivalence, and checking if a regexp matches a subset of the strings another regexp matches, required for checking assignability, is obviously going to be even more complicated.

* Regular expressions in the math sense: I'm only including single characters, (), (ab), (a|b), and (a*), where a and b are (potentially different) each members of this list.

mewalig commented 5 years ago

This is probably a dumb question, but... why isn't it fairly easy, if adequately limited, to support a validation function (either lambda or named)?

For example, suppose we use ":" to indicate that the next element is a validator (substitute whatever you want for ":" if you have an opinion on this):

type email = string : (s) => { return !!s.match(...) }
type phone_number = string : (n) => { return !!String(n).match(...) }
type excel_worksheet_name = string : (s) => { return (s != "History") && s.length <= 31 && ... }

As an initial start, typescript could only accept validation functions that:

have a single argument, which is required/assumed to be of the "base" type
only reference variables that are defined in the validator function
return a value (which will be coerced to bool in the validation process)

The above conditions seem easy for the typescript compiler to verify, and once those conditions are assumed, much of the implementation complexity would go away.

In addition, if necessary to restrict initial scope to a manageable size:

validation functions can only be added to a subset of native types (string, number)

I don't think this last restriction would be all that necessary, but if there is any question as to whether it would be, I also don't think it would be worth spending much time debating it, because a solution with the above limitation would still solve a huge range of real-world use cases. In addition, I see little downside of the above limitations because relaxing them later would be a simple and natural extension that would require no change in the basic syntax and would merely expand the breadth of language support by the compiler.

m93a commented 5 years ago

@mewalig That would mean that something that looks like a runtime function would actually not execute on runtime, but on compile time (and every time you want to check asignability). These functions couldn't access anything from the runtime (variables, functions) which would feel pretty awkward.

Plus you generally don't want the compiler to run anything you throw at it, especially badly optimized functions or outright malicious while(true){}. If you want meta-programming, you have to design it smartly. Just randomly allowing runtime code to run at compile time would be the "PHP way" to do it.

Finally, the syntax you propose switches the usual pattern

let runtime: types = runtime;

(ie. types after colon) inside out, effectively being

type types = types: runtime;

which is horrible. So thank you for your proposal, but it's definitely a bad idea.

simonbuchan commented 5 years ago

These functions couldn't access anything from the runtime (variables, functions) which would feel pretty awkward.

Of course they could, if the compiler has an ECMAScript runtime available to it (tsc does, BTW!). You obviously have an ambiguity issue with the compile-time semantics of e.g. fetch() vs. runtime semantics, but that's what iteration is about.

Just randomly allowing runtime code to run at compile time would be the "PHP way" to do it.

It's pretty similar to C++ constexpr functions, which are fine. The solution there is to say that constexpr can only use constexpr, but everything can use constexpr. Then you could have constexpr-equivalent versions of the filesystem for the compile-time filesystem which could be quite powerful.

The syntax also looks roughly fine to me: the LHS is a type, of course the RHS is a type of some sort too. My issue is more about how you would compose types past the "base" type, but that's all solvable too.

So thank you for your proposal, but it's definitely a bad idea.

It may end up being a bad idea, but for now I'm just seeing a very underspecified idea that will likely require straying too far from the goals of typescript. It doesn't mean that there might not be a good idea that is similar to it!

mpawelski commented 5 years ago

The discussion about this feature seems to stop for now (PR is closed and according to Design notes team don't want to commit to this until we have nominal types and generalized index signatures, and we should know what those look like.).

Anyway, I want to propose another hypothetical extension to current PR that would support regex pattern extraction (@isiahmeadows presented his own proposal, but to be honest I cannot wrap my head around it now...).

I really like current PR and would base my proposal on that. I would like to propose the syntax based on generic type arguments inference that we have for functions (and conditional types with infer keyword). Simply because people already have some intuition that in generic function you can "extract" types from passed literal objects.

For example we have this type.

type Prop1 = /(\w)\.(\w)/

and we can use this type to test literal types

const goodLiteral = "foo.bar";
const badLiteral = "foo";
const regextTest: Prop1 = goodLiteral; //no error
const regextTest: Prop1 = badLiteral; //compiler error

function funProp1(prop: Prop1) { } 

funProp1(goodLiteral); //no error
funProp1(badLiteral); //error

However, when we use Regex type in function parameter we can use angle brackets syntax to mean that we want to infer matched strings. For example

type Prop1 = /(\w)\.(\w)/
const Prop1 = /(\w)\.(\w)/

const goodLiteral = "foo.bar";
const badLiteral = "foo";

function funProp1<M1 extends string, M2 extends string>(prop: Prop1<M1, M2>) : [M1, M2] 
{
    const m = prop.match(Prop1);
    return [m[1], m[2]];
} 

const res1 = funProp1(goodLiteral); //no error. Function signature inferred to be funProp<"foo", "bar">(prop: Prop1<"foo", "bar">) : ["foo", "bar"]
const res2 = funProp1(badLiteral); //compiler error

notice that inferred type of res1 is ["foo", "bar"]

Is it any useful?

Ember.js/lodash get function

You could implement type-safe "string path" getter so this code would work:

const deep = get(objNested, "nested.very.deep")

But probably it would require to solve this if we want to avoid many overloads for fixed maximum number of possible get's "depth".

Use extracted parameters in mapped types.

For example if we would be able to do something like this https://github.com/Microsoft/TypeScript/issues/12754. Then we could have possibility to reverse function (strip some prefix/suffix from all properties of given type). This one would probably need to to introduce some more generalized form of mapped typed syntax to chose new key for property (for example syntax like { [ StripAsyncSuffix<P> for P in K ] : T[P] }, someone already proposed something like that)

Probably there would be a other use cases too. But I guess most would fit in these two types (1. figuring out proper type based on provided string literal, 2. transforming property names of input type to new property names of new defined type)

thomasparsons commented 5 years ago

This is something we could do with.

I am currently building custom lint rules in order to be able to validate urls - though, this would be much easier if we could define the optional params - which requires a regex in order to be able to validate our ids

In general, this would provide us with much more power to assert the validity of props across our code base

lf94 commented 5 years ago

Is there any movement on the type providers, template string literals, or other suggestions? This would be such a great tool.

omidkrad commented 4 years ago

My workaround for this currently is to use a marker interface like this.

interface TickerSymbol extends String {}

The only problem is that when I want to use it as a index key, I have to cast it to string.

interface TickerSymbol extends String {}
var symbol: TickerSymbol = 'MSFT';
// declare var tickers: {[symbol: TickerSymbol]: any}; // Error: index key must be string or number
declare var tickers: {[symbol: string]: any};
// tickers[symbol]; // Type 'TickerSymbol' cannot be used as an index type
tickers[symbol as string]; // OK

However, JavaScript seems to be fine with index type of String (with capital S).

var obj = { one: 1 }
var key = new String('one');
obj[key]; // TypeScript Error: Type 'String' cannot be used as an index type.
// but JS gives expected output:
// 1

dead-claudia commented 4 years ago

@DanielRosenwasser I have a proposal here, and a separate proposal was created in late 2016, so could the labels for this be updated?

RyanCavanaugh commented 4 years ago

We've reviewed the above proposals and have some questions and comments.

Problematic Aspects of Proposals so far

Types Creating Emit

We're committed to keeping the type system fully-erased, so proposals that require type aliases to produce emitted code are out of scope. I'll highlight some examples in this thread where this has happened perhaps in a way that isn't obvious:

https://github.com/microsoft/TypeScript/issues/6579#issuecomment-220180091 - creates a function and a type at the same tim

type Integer(n:number) => String(n).macth(/^[0-9]+$/)

https://github.com/microsoft/TypeScript/issues/6579#issuecomment-261519733 - also does this

type CssColor = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;
// ... later
setFontColorFromString(color: string) {
    fontColor = color;// compile time error
    if (CssColor.test(color)) {
    //  ^^^^^^^^ no value declaration of 'CssColor' !
        fontColor = color;// correct
    }
}

I'll reiterate: this is a non-starter. Types in TypeScript are composable and emitting JS from types is not possible in this world. The longest proposal to date has extensive emit-from-types; this isn't workable. For example, this would require extensive type-directed emit:

type Matcher<T extends number | boolean> = T extends number ? /\d+/ : /true|false/;
function fn<T extends number | boolean(arg: T, s: Matcher<T>) {
  type R = Matcher<T>
  if (R.test(arg)) {
      // ...
  }
}
fn(10, "10");
fn(false, "false");

Bans on Intersections

Actually common types and regex-validated types are really different, so we need rules how correclty handle their unions and intersections.

type Regex_1 = / ... /;
type Regex_2 = / ... /;
type NonRegex = { ... };
type test_4 = Regex_1 & NonRegex;// compile time error

TypeScript can't error on instantiations of intersections, so this wouldn't be part of any final design.

Ergonomics

Overall our most salient takeaway is that we want something where you're not writing the same RegExp twice (once in value space, once in type space).

Given the above concerns about type emit, the most realistic solution is that you would write the expression in value space:

// Probably put this in lib.d.ts
type PatternOf<T extends RegExp> = T extends { test(s: unknown): s is infer P } ? P : never;

const ZipCode = /^\d\d\d\d\d$/;
function map(z: PatternOf<typeof ZipCode>) {
}

map('98052'); // OK
map('Redmond'); // Error

You could still write the RegExp in type space, of course, but there'd be no runtime validation available and any nonliteral use would require a re-testing or assertion:

function map(z: /^\d\d\d\d\d$/) { }
map('98052'); // OK
map('Redmond'); // Error

function fn(s: string) {
    map(s); // Error
    // typo
    if (/^\d\d\d\d$/.test(s)) {
        // Error, /^\d\d\d\d$/ is not assignable to /^\d\d\d\d\d$/
        map(s);
    }

    if (/^\d\d\d\d\d$/.test(s)) {
        // OK
        map(s);
    }
}

Collection and Clarification of Use Cases

For a new kind of type, we'd ideally like to see several examples where:

The problem being solved has no better alternative (including plausible alternatives which aren't yet in the language)
The problem occurs with meaningful frequency in real codebases
The proposed solution solves that problem well

Compile-Time Validation of Literals

This thread implies a wide variety of use cases; concrete examples have been more rare. Troublingly, many of these examples don't seem to be complete - they use a RegExp that would reject valid inputs.

Font color - AFAIK anything that accepts hex colors also accepts e.g. "white" or "skyblue". This also incorrectly rejects rgb(255, 0, 0) syntax.
SSN, Zip, etc - OK, but why are there literal SSNs or Zip Codes in your code? Is this actually a need for nominal types? What happens if you have a subclass of strings that can't be accurately described by a RegExp? See "Competing proposals"
- Integer - incorrectly rejects "3e5"
- Email - This is usually considered a bad idea. Again though, there are email address string literals in your code?
- CSS Border specs - I could believe that a standalone library could provide an accurate RegEx to describe the DSL it itself supports
- Writing tests - this is where hardcoded inputs make some sense, though this is almost a counterpoint because your test code should probably be providing lots of invalid inputs
- Date formats - how/why? Date has a constructor for this; if the input comes from outside the runtime then you just want a nominal type
- URI - you could imagine that fetch would specify host to not being with http(s?):

TODO: Please help us by identifying real library functions that could benefit from RegExp types, and the actual expression you'd use.

One concern is "precisionitis" - what happens when someone helpfully shows up to DefinitelyTyped and adds RegExp types to every function in a library, thus breaking every nonliteral invocation? Worse, the definition file authors will have to agree exactly with the consumers of it what the "right spelling" of a validation RegExp is. It seems like this quickly puts us on the road to a Tower of Babel situation where every library has their own version of what qualifies as a URL, what qualifies as a host name, what qualifies as an email, etc, and anyone connecting two libraries has to insert type assertions or copy regexes around to satisfy the compiler.

Enforcement of Runtime Checks

There has been some discussion of checks where we want to ensure that a function's arguments have been validated by a prior regex, like fn in the earlier Ergonomics section. This seems straightforward and valuable, if the RegEx that needs testing against is well-known. That's a big "if", however -- in my recollection, I can't remember a single library that provides validation regexes. It may provide validation functions - but this implies that the feature to be provided is nominal or tagged types, not regex types.

Counter-evidence to this assessment is welcomed.

Property Keys / Regex String Indexers

Some libraries treat objects according to the property names. For example, in React we want to apply types to any prop whose name starts with aria-:

interface IntrinsicElements {
    // ....
    [attributeName: /aria-\w+/]: number | string | boolean;
}

This is effectively an orthogonal concept (we could add Regex types without adding Regex property keys, and vice versa).

TODO (me or anyone): Open a separate issue for this.

Competing Proposals

Nominal or Tagged types

Let's say we had nominal/tagged types of some sort:

type ZipCode = make_unique_type string;

You could then write a function

function asZipCode(s: string): ZipCode | undefined {
    return /^\d\d\d\d\d$/.test(s) ? (s as ZipCode) : undefined;
}

At this point, would you really even need RegExp types? Refer to "compile-time" checking section for more thoughts.

Conversely, let's say we had RegExp types and not nominal types. It becomes pretty tempting to start (ab)using them for non-validation scenarios:

type Password = /(IsPassword)?.*/;
type UnescapedString = /(Unescaped)?.*/;
declare function hash(p: Password): string;

const p: Password = "My security is g00d"; // OK
const e: UnescapedString = "<div>''</div>"; // OK
hash(p); // OK
hash(e); // Error
hash("correct horse battery staple"); // OK

A common thing in the thread is that these regexes would help validate test code, because even though in production scenarios the code would be running against runtime-provided strings rather than hardcoded literals, you'd still want some validation that your test strings were "correct". This would seem to be an argument for nominal/tagged/branded strings instead, though, since you'd be writing the validation function either way, and the benefit of tests is that you know they run exhaustively (thus any errors in test inputs would be flagged early in the development cycle).

Non-Issues

We discussed the following aspects and consider them to not be blockers

Host Capabilities

Newer runtimes support more RegExp syntax than older runtimes. Depending on where the TypeScript compiler runs, certain code might be valid or invalid according to the runtime's capabilities of parsing newer RegExp features. In practice, most of the new RegExp features are fairly esoteric or relate to group matching, which don't seem to align with most of the use cases here.

Performance

RegExes can do an unbounded amount of work and matching against a large string can do an arbitrarily large amount of work. Users can already DOS themselves through other means, and are unlikely to write a maliciously inefficient RegExp.

Subtyping (`/\d/` `->` `/./` ?), Union, Intersection, and Uninhabitability

In theory /\d+/ is a knowable subtype of /.+/. Supposedly algorithms exist to determine if one RegExp matches a pure subset of another one (under certain constraints), but obviously would require parsing the expression. In practice we're 100% OK with RegExpes not forming implicit subtype relationships based on what they match; this is probably even preferable.

Union and Intersection operations would work "out of the box" as long as the assignability relationships were defined correctly.

In TypeScript, when two primitive types "collide" in an intersection, they reduce to never. When two RegExpes are intersected, we'd just keep that as /a/ & /b/ rather than try to produce a new RegExp matching the intersection of the two expressions. There wouldn't be any reduction to never we'd need an algorithm to prove that no string could satisfy both sides (this is a parallel problem to the one described earlier re: subtyping).

Next Steps

To summarize, the next steps are:

File a separate issue for Regex-named property keys AKA regex string indexers
Get concrete and plausible use cases for compile-time validation of string literals
- Example: Identify functions in DefinitelyTyped or other libraries that would highly benefit from this
Understand if nominal/tagged/branded types are a more flexible and broadly-applicable solution for non-literal validation
Identify libraries that are providing validation RegExes already

katywings commented 4 years ago

Use case: Hyperscript (https://github.com/hyperhype/hyperscript) like functions A hyperscript function usually is called like h('div#some-id') A regex-ish pattern matcher would allow to determine the return type of h which would be HTMLDivElement in the example case.

Akxe commented 4 years ago

If the type system would be able to add string literals, then basically any CSS property could be type-safe

declare let width: number;
declare let element: HTMLElement;

element.style.height = `${width}px`;
// ...or
element.style.height = `${width}%`;

CSS selectors could be validated too (element.class#id - valid, div#.name - invalid)

If capturing groups would work (somehow) then Lodash's get method could be type-safe

var object = { 'a': [{ 'b': { 'c': 3 } }] };

_.get(object, 'a[0].b.c');

This could be a thing too:

interface IOnEvents {
  [key: PatternOf</on[a-z]+/>]: (event: Event) => void;
}

interface IObservablesEndsOn$ {
  [key: PatternOf</\$$/>]: Observable<any>;
}

RyanCavanaugh commented 4 years ago

Use case: Hyperscript (hyperhype/hyperscript) like functions

What would that regex look like, or what validation would it provide? Is this for regex-based function overloading?

FWIW The library accepts namespaced tag names and also functions on arbitrary tag names

> require("hyperscript")("qjz").outerHTML
'<qjz></qjz>'

It also accepts an unbounded mixing of class and id values

> require("hyperscript")("baz.foo#bar.qua").outerHTML
'<baz class="foo qua" id="bar"></baz>'

RyanCavanaugh commented 4 years ago

CSS selectors could be validated too

CSS selectors cannot be validated by a regular expression

simonbuchan commented 4 years ago

What would that regex look like, or what validation would it provide? Is this for regex-based function overloading?

Not the OP, but I presume, yes, something like the HTMLDocument#createElement() overloads, e.g.:

// ...
export declare function h(query: /^canvas([\.#]\w+)*$/): HTMLCanvasElement;
// ...
export declare function h(query: /^div([\.#]\w+)*$/): HTMLDivElement;
// ...

I'm sure the REs are incomplete. Note that this is a special case of validating CSS selectors, which are used in many contexts in a regular way. For example, it's perfectly OK for HTMLDocument.querySelector() to return HTMLElement as a fallback if you're using a complex selector.

I am curious if there are non-overloading examples that are both feasible and useful, though.

omidkrad commented 4 years ago

TODO: Please help us by identifying real library functions that could benefit from RegExp types, and the actual expression you'd use.

My use case is the one I explained in this comment in the CCXT library where I have strings that represent TickerSymbols. I don't really care if they are checked for a regex pattern, but I want them to be treated as sub-types of string so I get more strict assignments, parameter type checking, etc. I found it to be very useful when I'm doing functional programming, with that I can easily track TickerSymbols, Currencies, Assets, etc at compile-time where at run-time they are just normal strings.

m93a commented 4 years ago

@omidkrad This sounds like you need nominal types, not regex-validated types.

omidkrad commented 4 years ago

@m93a In my case I will be fine with nominal types, but for the same use case you could use regex-validated types for stricter type checking and self-documenting the string types.

Akxe commented 4 years ago

CSS selectors could be validated too

CSS selectors cannot be validated by a regular expression

Well, if the regexp would enable us to stitch them together we could copy CSS regexes..., right?

acutmore commented 4 years ago

The (draft) CSS Typed Object Model

https://drafts.css-houdini.org/css-typed-om/

https://developers.google.com/web/updates/2018/03/cssom

Potentially alleviates the desire to use the stringly-typed CSS model.

el.attributeStyleMap.set('padding', CSS.px(42));
const padding = el.attributeStyleMap.get('padding');
console.log(padding.value, padding.unit); // 42, 'px'

dead-claudia commented 4 years ago

@RyanCavanaugh For Mithril in particular, the tag name is extracted via the capture group in ^([^#\.\[\]]+) (defaults to "div"), but matching ^(${htmlTagNames.join("|")}) would be sufficient for our purposes. And so using my proposal, this would be sufficient for my purposes:

type SelectorAttrs = "" | `#${string}` | `.${string}`;

type GetTagName<T extends string> =
    T extends SelectorAttrs ? "div" :
    T extends `${keyof HTMLElementTagNameMap & (infer Tag)}${SelectorAttrs}` ? T :
    string;

As for events and attributes, we could switch to this once negated types land:

type EventsForElement<T extends Element> =
    T extends {addEventListener(name: infer N, ...args: any[]): any} ? N : never;

type MithrilEvent<E extends string> =
    (E extends EventsForElement<T> ? HTMLElementEventMap[E] : Event) &
    {redraw?: boolean};

type Attributes<T extends Element> =
    LifecycleAttrs<T> &
    {[K in `on${string}` & not LifecycleAttrs<T>](
        ev: K extends `on${infer E}` ? MithrilEvent<E> : never
    ): void | boolean} &
    {[K in keyof T & not `on${string}`]: T[K]} &
    {[K in string & not keyof T & not `on${string}`]: string};

BTW, this seamless integration and avoidance of complexity is why I still prefer my proposal over literal regexps.

I know of no way to do this with pure regexp types, though. I do want to point that out.

Ovyerus commented 4 years ago

TODO: Please help us by identifying real library functions that could benefit from RegExp types, and the actual expression you'd use.

bent has a different return type based on what is given as a string that describes the expected response type, e.g.

bent('json')('https://google.com') // => Promise<JSON>
bent('buffer')('https://google.com') // => Promise<Buffer | ArrayBuffer>
bent('string')('https://google.com') // => Promise<String>

It also accepts some other arguments, such as method and url as strings, but these can appear in any position, so if we try to use unions to describe all the return type ('json' | 'buffer' | 'string'), this would instead dumb down to just string when combined with the url and method types in the union, meaning we can't automatically infer the return type based on the type given in first call.

DanielRosenwasser commented 4 years ago

@Ovyerus how would regex types help you there? What would you expect to write? You can model something similar to bent's behavior with overloads or conditional types.

type BentResponse<Encoding> = Promise<
    Encoding extends "json" ? MyJsonType :
    Encoding extends "buffer" ? Buffer | ArrayBuffer :
    Encoding extends "string" ? string :
    Response
>;

declare function bent<T extends string>(urlOrEncoding: T): (url: string) => BentResponse<T>;

http://www.typescriptlang.org/play/index.html#code/C4TwDgpgBAQhB2wBKEDOYD29UQDwFF4BjDAEwEt4BzAPigF4oAFAJwwFtydcBYAKCiCohEhWpQIAD2AJSqKACIAVqiwKoAfigBZEAClV8ACrhoALn5DhxMpSoTps+QoBGAVwBmHiC3VaYnt4sUAA+UACCLCwAhiABXj5QFgJCIrbiUjLwcoqowCx2flB5BeLJVijoWDj8NADc-PykEEQANtEs0B5uxMDkWFAuCMC4Rg5ZOSV2NAAUbiytAPIsaWJUZlBGAJQbcwsbU9RbDHRwiJWY2HhG9Y18lDIsHtFE0PFBUADeUAD67gksDbReAgKAAX34Dx8z1eOn0hhMkC+vxUWCBIPBdxm0VQIGIUBmx3odE+liErQgwCgkg2ugMWER0EY0QA7tFyFShogZspDAotjyABbAYBgVBmAD0Eqk0XYYApADoSOx+Q0+GCBVsgA

Ovyerus commented 4 years ago

Oh I was unclear sorry, I believe my issue was more along the lines of matching http(s): at the start of a string to detect base URL.

Bent's signature is more along the lines of

type HttpMethods = 'GET' | 'PATCH' | ...
type StatusCode = number;
type BaseUrl = string; // This is where I would ideally need to see if a string matches http(s):
type Headers = { [x: string]: any; };

type Options = HttpMethods | StatusCode | BaseUrl | Headers;

function bent(...args: Options[]): RequestFunction<RawResponse>
function bent(...args: (Options | 'json')[]): RequestFunction<JSON>
// and so on

However having BaseUrl as a string absorbs the HttpMethods and return type unions, which ends up as just string. Having it just as a string also "improperly" matches how bent works, as it does check for the presence of ^http: or ^https: in order to determine what it should use as the base url.

If we had regex types, I could define BaseUrl as type BaseUrl = /^https?:/, and this ideally would properly verify strings that aren't a HTTP method or response encoding, as well as not absorbing them into the string type.

prokopsimek commented 4 years ago

Exactly, I am the same.

-- Prokop Simek

On 20 October 2019 at 03:23:30, Michael Mitchell (notifications@github.com) wrote:

Oh I was unclear sorry I believe my issue was more along the lines of matching http(s): at the start of a string to detect base URL.

Bent's signature is more along the lines of

type HttpMethods = 'GET' | 'PATCH' | ...type StatusCode = number;type BaseUrl = string; // This is where I would ideally need to see if a string matches http(s):type Headers = { [x: string]: any; }; type Options = HttpMethods | StatusCode | BaseUrl | Headers; function bent(...args: Options[]): RequestFunctionfunction bent(...args: (Options | 'json')[]): RequestFunction// and so on

However having BaseUrl as a string absorbs the HttpMethods and return type unions, which ends up as just string. Having it just as a string also "improperly" matches how bent works, as it does check for the presence of ^http: or ^https: in order to determine what it should use as the base url.

If we had regex types, I could define BaseUrl as type BaseUrl = /^https?:/, and this ideally would properly verify strings that aren't a HTTP method or response encoding, as well as not absorbing them into the string type.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/TypeScript/issues/6579?email_source=notifications&email_token=ABJ3U4JNK3V5MV4DJH73ZU3QPOXJFA5CNFSM4BZLAVSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBYABKA#issuecomment-544211112, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJ3U4PHBXO4766LK7P7UXDQPOXJFANCNFSM4BZLAVSA .

distractdiverge commented 4 years ago

The thought I had of a use case was to detect parameter types to a function.

Basically I have a well defined regex format of a string representing an identifier. I could use decorators, but an extended string type would let me use a type to represent the identifier passed to the function.

DanielRosenwasser commented 4 years ago

To reiterate, we need examples of JavaScript code you want to write in a typed way - otherwise we can only guess what you're trying to model (and whether there's already a way to model it).

yannickglt commented 4 years ago

@DanielRosenwasser Below's an example of code we would like to enforce typing about. http://www.typescriptlang.org/play/index.html#code/C4TwDgpgBAqjCSARKBeKBnYAnAlgOwHMBYAKFIGMAbAQ3XVnQiygG9SoOoxcA3a4aJn45yULBGoATAPZ5KIKAFtq+GIywAuWAmRoA5NQCcEAAwBGQwCMArAFoAZtYBMAdlsAWcvYActw2ZM7JzNJF28TCCcANgtLPXZOAHpErl5+QWBhUXEpWXklFTw1Ji04JFQoMycAZihkqAA5AHkAFSgAQQAZTqaAdQBRRASOeu4cPgEMTOARMQkZOQVlVXVSnQq9PGlgW2pbAFd9nEk9OpSAZQAJJphO5Ga2gCF+ju6+wah++BbL-oAlYZQciyTBYfbkYDSLAACkBnC4+0slFmxzWSAANHDOGBEcjRJYsNQ8JItKD8ARMSR4fCcUjZuocNRKFo8PtFJYmJTqdjcbNyDkBJJHiA0boGEwAHTLIrqACUrFICQAvqQVSQgA

RyanCavanaugh commented 4 years ago

@yannickglt it seems like you want a nominal type, not a RegExp type? You're not expecting callers to show up with random site-validated invocations like this:

// OK
someFunc('a9e019b5-f527-4cf8-9105-21d780e2619b');
// Also OK, but probably really bad
someFunc('a9e019b5-f527-4cf8-9106-21d780e2619b');
// Error
someFunc('bfe91246-8371-b3fa-3m83-82032713adef');

IOW the fact that you are able to describe a UUID with a regular expression is an artifact of the format of the string itself, whereas what you are trying to express is that UUIDs are a special kind of type whose backing format happens to be a string.

Shinigami92 commented 4 years ago

So the combination of 3.7's Assertion Functions and the nominal Feature can do this (?)

nominal UUID = string

function someFunc(uuid: any): asserts uuid is UUID {
  if (!UUID_REGEX.test(uuid)) {
    throw new AssertionError("Not UUID!")
  }
}

class User {
  private static readonly mainUser: UUID = someFunc('a9e019b5-f527-4cf8-9105-21d780e2619b')
  // private static readonly mainUser: UUID = someFunc(123) // assertion fails
  // private static readonly mainUser: UUID = someFunc('not-a-uuid') // assertion fails
  constructor(
    public id: UUID,
    public brand: string,
    public serial: number,
    public createdBy: UUID = User.mainUser) {

  }
}

Will this fail also?

new User('invalid-uuid', 'brand', 1) // should fail
new User('invalid-uuid' as UUID, 'brand', 1) // 🤔

Shinigami92 commented 4 years ago

After thinking for a while, I see a problem with my proposed solution 🤔 The asserts only trigger an error at runtime -> 👎 The Regex-Validation could trigger a compile-time error -> 👍 Otherwise, this proposal makes no sense

Edit: Another issue: someFunc(uuid: any): asserts uuid is UUID doesn't return an UUID, it throws or returns is UUID -> true. So I cant use this function to assign an UUID in this way to mainUser

dead-claudia commented 4 years ago

@RyanCavanaugh We want these to be correctly typed for Mithril:

// <div id="hello"></div>
m("div#hello", {
    oncreate(vnode) { const dom: HTMLDivElement = vnode.dom },
})

// <section class="container"></section>
m("section.container", {
    oncreate(vnode) { const dom: HTMLElement = vnode.dom },
})

// <input type="text" placeholder="Name">
m("input[type=text][placeholder=Name]", {
    oncreate(vnode) { const dom: HTMLInputElement = vnode.dom },
})

// <a id="exit" class="external" href="https://example.com">Leave</a>
m("a#exit.external[href='https://example.com']", {
    oncreate(vnode) { const dom: HTMLAnchorElement = vnode.dom },
}, "Leave")

// <div class="box box-bordered"></div>
m(".box.box-bordered", {
    oncreate(vnode) { const dom: HTMLDivElement = vnode.dom },
})

// <details></details> with `.open = true`
m("details[open]", {
    oncreate(vnode) { const dom: HTMLDetailsElement = vnode.dom },
})

// alias for `m.fragment(attrs, ...children)`
m("[", {
    oncreate(vnode) { const dom: HTMLElement | SVGElement = vnode.dom },
}, ...children)

We want to statically reject these:

// selector must be non-empty
m("")

// incomplete class
m("div.")

// incomplete ID
m("div#")

// incomplete attribute
m("div[attr=")

// not special and doesn't start /[a-z]/i
m("@foo")

Ideally, we'd also want to statically reject these, but it's not as high priority and we can survive without them:

// event handers must be functions
m("div[onclick='return false']")

// `select.selectedIndex` is a number
m("select[selectedIndex='not a number']")

// `input.form` is read-only
m("input[type=text][form='anything']")

// `input.spellcheck` is a boolean, this evaluates to a string
// (This is a common mistake, actually.)
m("input[type=text][spellcheck=false]")

// invalid tag name for non-custom element
m("sv")

This would require a much more complicated type definition, one where we'd need a custom type check failure message to help users figure out why it failed to type check.

Other hyperscript libraries and hyperscript-based frameworks like react-hyperscript have similar concerns, too.

Hope this helps!

anion155 commented 4 years ago

@isiahmeadows better way for you to use some form of selector string builder, which will return branded string, with correct typings. Like:

m(mt.div({ attr1: 'val1' }))

dead-claudia commented 4 years ago

@anion155 There's other ways of getting there, too, but this is about typing a library whose API was designed by its original author back in 2014. If I were designing its API now, I'd likely use m("div", {...attrs}, ...children) with none of the hyperscript sugar (easier to type, much simpler to process), but it's far too late now to do much about it.

AnyhowStep commented 4 years ago

I have A LOT to say. However, I'm impatient. So, I'll be releasing my thoughts a bit at a time.

https://github.com/microsoft/TypeScript/issues/6579#issuecomment-542405537

Regarding "precisionitis" (man, I love that word), I don't think we should be worrying about it too much.

The type system is already turing complete. This basically means we can be super-precise about a lot of things. (Like, modeling all of SQL? Shameless plug =P)

But you don't see (too many) people going all-out, and using all the type operators in crazy ways that block libraries from being compatible with each other. I like to think that library authors tend to be level-headed enough... Right?

It's not often that I've wished for string-pattern types/regex-validated string types but they definitely would have helped increase the type safety of my code base.

Use Case

Off the top of my head, I can think of one recent example. (There are a bunch more but I'm a forgetful being)

When integrating with Stripe's API (a payment processing platform), they use ch_ for charge-related identifiers, re_ for refund-related identifiers, etc.

It would have been nice to encode them with PatternOf</^ch_.+/> and PatternOf</^re_.+/>.

This way, when making typos like,

charge.insertOne({ stripeChargeId : someObj.refundId });

I would get an error,

Cannot assign `PatternOf</^re_.+/>` to `PatternOf</^ch_.+/>`

As much as I love nominal/tagged types, they are far more unergonomic and error-prone. I always see nominal/tagged types as a last resort, because it means that there's something that the TS type system just cannot model.

Also, tagged types are great for phantom types. Nominal types are basically never useful. (Okay, I may be biased. They're useful only because of unique symbol But I like to think I'm not completely wrong.)

The "ValueObject" pattern for validation is even worse and I will not bother talking about it.

Comparison

Below, I will compare the following,

String-pattern types/regex-validated string types
Nominal types
Structural tag types

We can all agree that the "ValueObject" pattern is the worst solution, and not bother with it in the comparisons, right?

String-pattern types

const stripeChargeIdRegex = /^ch_.+/;
const stripeRefundIdRegex = /^re_.+/;

type StripeChargeId = PatternOf<typeof stripeChargeIdRegex>;
type StripeRefundId = PatternOf<typeof stripeRefundIdRegex>;

declare function takesStripeChargeId (stripeChargeId : StripeChargeId) : void;

declare const str : string;
takesStripeChargeId(str); //Error
if (stripeChargeIdRegex.test(str)) {
  takesStripeChargeId(str); //OK
}
if (stripeRefundIdRegex.test(str)) {
  takesStripeChargeId(str); //Error
}

declare const stripeChargeId : StripeChargeId;
declare const stripeRefundId : StripeRefundId;
takesStripeChargeId(stripeChargeId); //OK
takesStripeChargeId(stripeRefundId); //Error

takesStripeChargeId("ch_hello"); //OK
takesStripeChargeId("re_hello"); //Error

Look at that.

Perfect for string literals.
Not too bad for string non-literals.

Nominal types...

const stripeChargeIdRegex = /^ch_.+/;
const stripeRefundIdRegex = /^re_.+/;

type StripeChargeId = make_unique_type string;
type StripeRefundId = make_unique_type string;

function isStripeChargeId (str : string) : str is StripeChargeId {
  return stripeChargeIdRegex.test(str);
}
function isStripeRefundId (str : string) : str is StripeRefundId {
  return stripeRefundIdRegex.test(str);
}

declare function takesStripeChargeId (stripeChargeId : StripeChargeId) : void;

declare const str : string;
takesStripeChargeId(str); //Error
if (isStripeChargeId(str)) {
  takesStripeChargeId(str); //OK
}
if (isStripeRefundId(str)) {
  takesStripeChargeId(str); //Error
}

declare const stripeChargeId : StripeChargeId;
declare const stripeRefundId : StripeRefundId;
takesStripeChargeId(stripeChargeId); //OK
takesStripeChargeId(stripeRefundId); //Error

takesStripeChargeId("ch_hello"); //Error? Ughhhh
takesStripeChargeId("re_hello"); //Error

takesStripeChargeId("ch_hello" as StripeChargeId); //OK, BUT UNSAFE
takesStripeChargeId("re_hello" as StripeChargeId); //OK, BUT WAIT! I MESSED UP

const iKnowThisIsValid = "ch_hello";
if (isStripeChargeId(iKnowThisIsValid)) {
  takesStripeChargeId(iKnowThisIsValid); //OK
} else {
  throw new Error(`Wat? This should be valid`);
}

function assertsStripeChargeId (str : string) : asserts str is StripeChargeId {
  if (!isStripeChargeId(str)) {
    throw new Error(`Expected StripeChargeId`);
  }
}
assertsStripeChargeId(iKnowThisIsValid);
takesStripeChargeId(iKnowThisIsValid); //OK

function makeStripeChargeIdOrError (str : string) : StripeChargeId {
  assertsStripeChargeId(str);
  return str;
}
takesStripeChargeId(makeStripeChargeIdOrError("ch_hello")); //OK
takesStripeChargeId(makeStripeChargeIdOrError("re_hello")); //OK, compiles, throws during run-time... Not good

Look at that.

TERRIBLE for string literals.
After overcoming the string literal hurdle, it's not too bad... Right?

But the main use-case for this proposal is string literals. So, this is a terrible alternative.

Structural tag types...

Structural tag types are not much different from nominal types...

const stripeChargeIdRegex = /^ch_.+/;
const stripeRefundIdRegex = /^re_.+/;

type StripeChargeId = string & tag { stripeChargeId : void };
type StripeRefundId = string & tag { stripeRefundId : void };

function isStripeChargeId (str : string) : str is StripeChargeId {
  return stripeChargeIdRegex.test(str);
}
function isStripeRefundId (str : string) : str is StripeRefundId {
  return stripeRefundIdRegex.test(str);
}

declare function takesStripeChargeId (stripeChargeId : StripeChargeId) : void;

declare const str : string;
takesStripeChargeId(str); //Error
if (isStripeChargeId(str)) {
  takesStripeChargeId(str); //OK
}
if (isStripeRefundId(str)) {
  takesStripeChargeId(str); //Error
}

declare const stripeChargeId : StripeChargeId;
declare const stripeRefundId : StripeRefundId;
takesStripeChargeId(stripeChargeId); //OK
takesStripeChargeId(stripeRefundId); //Error

takesStripeChargeId("ch_hello"); //Error? Ughhhh
takesStripeChargeId("re_hello"); //Error

takesStripeChargeId("ch_hello" as StripeChargeId); //OK, BUT UNSAFE
takesStripeChargeId("re_hello" as StripeChargeId); //OK, BUT WAIT! I MESSED UP

const iKnowThisIsValid = "ch_hello";
if (isStripeChargeId(iKnowThisIsValid)) {
  takesStripeChargeId(iKnowThisIsValid); //OK
} else {
  throw new Error(`Wat? This should be valid`);
}

function assertsStripeChargeId (str : string) : asserts str is StripeChargeId {
  if (!isStripeChargeId(str)) {
    throw new Error(`Expected StripeChargeId`);
  }
}
assertsStripeChargeId(iKnowThisIsValid);
takesStripeChargeId(iKnowThisIsValid); //OK

function makeStripeChargeIdOrError (str : string) : StripeChargeId {
  assertsStripeChargeId(str);
  return str;
}
takesStripeChargeId(makeStripeChargeIdOrError("ch_hello")); //OK
takesStripeChargeId(makeStripeChargeIdOrError("re_hello")); //OK, compiles, throws during run-time... Not good

Look at that.

TERRIBLE for string literals.
After overcoming the string literal hurdle, it's not too bad... Right?

But the main use-case for this proposal is string literals. So, this is a terrible alternative.

Also, this structural tag type example is a literal (ha, pun) copy-paste of the nominal type example.

The only difference is in how the types StripeChargeId and StripeRefundId are declared.

Even though the code is basically the same, structural types are better than nominal types. (I'll clarify this in the next post, I swear).

Conclusion

This is just a conclusion for this comment! Not a conclusion to my overall thoughts!

String-pattern types/regex-validated string types are more ergonomic than nominal/structural tag types. Hopefully, my simple examples were not too contrived have demonstrated that, sufficiently.

Conclusion (Extra)

As much as possible, ways to take the subset of a primitive type should always be preferred over nominal/structural tag/value-object types.

Examples of taking the subset of primitive types,

string literals
number literals (exluding NaN, Infinity, -Infinity)
boolean literals
bigint literals
Even unique symbol is just taking a subset of symbol

Out of the above examples, only boolean is "finite enough". It only has two values. Developers are satisfied with having true and false literals because there's not much else to ask for.

The number type is finite-ish but it has so many values, we might as well consider it infinite. There are also holes in what literals we can specify.

This is why the range number type, and NaN, Infinity, -Infinity issues are so popular, and keep popping up. Being able to specify a small finite set of values, from an infinite set is not good enough.

Specifying a range is one of the most common/natural ideas to come to someone when they need to specify a large finite/infinite subset of an infinite set.

The bigint type is basically infinite, limited only by memory.

It also contributes to the popularity of the range number type issue.

The string type is basically infinite, limited only by memory.

And this is why this string-pattern type/regex-validated string type issue is so popular.

Specifying a regex is one of the most common/natural ideas to come to someone when they need to specify a large finite/infinite subset of an infinite set.

The symbol type... It's also infinite. And also unbounded, pretty much.

But the elements of the symbol type are all pretty much unrelated to each other, in almost every way. And, so, no one has made an issue to ask, "Can I have a way to specify a large finite/infinite subset of symbol?".

To most people, that question doesn't even make sense. There isn't a meaningful way to do this (right?)

However, just being able to declare subsets of primitives isn't very useful. We also need,

Literals of the right type must be assignable without further work

Thankfully, TS is sane enough to allow this.

Imagine being unable to pass false to (arg : false) => void!
Builtin ways of narrowing

At the moment, for these literals, we have == & === as builtin ways of narrowing.

Imagine needing to write a new type guard for each literal!

The problem with nominal/structural tag/value-object types is that they basically fail to fulfill the above two criteria. They turn primitive types into clunky types that aren't quite object types, but must be handled like object types, anyway.

AnyhowStep commented 4 years ago

Ergonomics

Okay, here's more elaboration on string-pattern vs nominal vs structural tag types.

These arguments apply to https://github.com/microsoft/TypeScript/issues/15480 as well.

Cross-Library Compatibility

Nominal types are the worst at cross-library compatibility. It's like using unique symbol in two libraries and trying to get them to interoperate. It simply cannot be done. You need to use a boilerplate type guard, or the trust-me-operator (as).

You'll need more boilerplate for an assertion guard, too.

If the type does not require cross-library compatibility, then using nominal types is fine... Even if very unergonomic (see above example).

For structural types, if library A has,

//Lowercase 'S'
type StripeChargeId = string & tag { stripeChargeId : void };

And library B has,

//Uppercase 'S'
type StripeChargeId = string & tag { StripeChargeId : void };

//Or
type StripeChargeId = string & tag { isStripeChargeId : true };

//Or
type StripeChargeId = string & tag { stripe_charge_id : void };

Then you'll need a boilerplate type guard, or the trust-me-operator (as).

You'll need more boilerplate for an assertion guard, too.

If the type does not require cross-library compatibility, then using structural types is fine... Even if very unergonomic (see above example).

For string-pattern types, if library A has,

type stripeChargeIdRegex = /^ch_.+/;
type StripeChargeId = PatternOf<typeof stripeChargeIdRegex>;

And library B has,

//Extra dollar sign at the end
type stripeChargeIdRegex = /^ch_.+$/;
type StripeChargeId = PatternOf<typeof stripeChargeIdRegex>;

//Or,
type stripeChargeIdRegex =/^ch_[a-zA-Z0-9]$/;
type StripeChargeId = PatternOf<typeof stripeChargeIdRegex>;

//Or,
type stripeChargeIdRegex =/^ch_[A-Za-z0-9]$/;
type StripeChargeId = PatternOf<typeof stripeChargeIdRegex>;

Assume both libraries always produce strings for StripeChargeId that will satisfy the requirements of both libraries. Library A is just "lazier" with its validation. And library B is "stricter" with its validation.

Then, it's kind of annoying. But not too bad. Because you can just use libraryB.stripeChargeIdRegex.test(libraryA_stripeChargeId) as the typeguard. No need to use the trust-me-operator (as).

You'll still need boilerplate for assertion guards, though.

If the type does not require cross-library compatibility, then using string-pattern types is perfect, and also very ergonomic.

If you need cross-library compatibility, string-pattern types are still better than structural tag types! Hear me out.

If the domain being modeled is well-understood, then it is very likely that multiple, isolated library authors will end up writing the same regex. With structural tag types, they could all just write whatever properties and types they want in the tags.

If there's a standard specifying string formats for whatever is being modeled, then it is basically guaranteed that all library authors will write the same regex! If they write a different regex, they're not really following the standard. Do you want to use their library? With structural tag types, they could still all just write whatever. (Unless someone starts a structural tag type standard that everyone will care about? Lol)

Cross-Version Compatibility

As usual, nominal types are the worst at cross-version compatibility. Oh, you bumped your library a patch, or minor version? The type decalaration is still the same? The code is still the same? Nope. They're different types.

Structural tag types are still assignable, across versions (even major versions), as long as the tag type is structurally the same.

String-pattern types are still assignable, across versions (even major versions), as long as the regex is the same.

Or we could just run a PSPACE-complete algorithm to determine if the regexes are the same? We can also determine which subclasses of regexes are the most common and run optimized equivalence algorithms for those... But that sounds like a lot of effort.

Regex subtype checks would be cool to have, and would definitely make using string-pattern types more ergonomic. Just like how range subtype checks would benefit the number range type proposal.

[Edit] In this comment, https://github.com/microsoft/TypeScript/issues/6579#issuecomment-243338433

Someone linked to, https://bora.uib.no/handle/1956/3956

Titled, "The Inclusion Problem for Regular Expressions" [/Edit]

Boilerplate

TODO (But we can see that string-pattern types have the least amount of boilerplate)

Literal Invocation

TODO (But we can see that string-pattern types support literal invocation the best)

Non-Literal Invocation

TODO (But we can see that string-pattern types support non-literal invocation the best)

AnyhowStep commented 4 years ago

More regarding https://github.com/microsoft/TypeScript/issues/6579#issuecomment-542405537

TypeScript can't error on instantiations of intersections, so this wouldn't be part of any final design.

I don't know why people wanted to ban intersections, but you're absolutely right that banning it doesn't make sense.

thus breaking every nonliteral invocation?

Well, not every non-literal invocation.

declare function foo (arg : PatternOf</a+/>) : void;
function bar (arg : PatternOf</a+/>) : void {
  //non-literal and does not break.
  foo(arg);
}
bar("aa"); //OK
bar("bb"); //Error
bar("" as string); //Error, I know this is what you meant by non-literal invocation

function baz (arg : "car"|"bar"|"tar") : void {
  bar(arg); //OK
}

Breaking on a non-literal invocation, where it cannot prove that it matches the regex, isn't necessarily a bad thing. It's just a type-safety thing.

That's kind of like saying that string literals are bad because now non-literal invocations fail. String-pattern types/regex-validated string types just let you define unions of an infinite number of string literals.

any nonliteral use would require a re-testing or assertion:

I don't see that as an issue at all. It's the same with nominal/tagged types right now. Or trying to pass a string to a function expecting string literals. Or trying to pass a wider type to a narrower type.

In this particular case, you've shown that const ZipCode = /^\d\d\d\d\d$/; and ZipCode.test(s) can act as a type guard. This will certainly help with the ergonomics.

The problem being solved has no better alternative (including plausible alternatives which aren't yet in the language)

Well, hopefully I've shown that nominal/structural tag types are not the better alternative. They're actually pretty bad.

The problem occurs with meaningful frequency in real codebases

Uhh... Let me get back to you on that one...

The proposed solution solves that problem well

The proposed string-pattern type seems to be pretty good.

TODO: Please help us by identifying real library functions that could benefit from RegExp types, and the actual expression you'd use.

Your view is that nominal/tagged types are good enough for non-literal use. So, any use case brought up that shows non-literal usage is not good enough, because nominal/tagged types cover it.

However, we've seen that, even for non-literal use,

Nominal/structural tag types suffer from cross-library/version compatibility issues
Amount of boilerplate for nominal/strucutral tag types is significantly more than boilerplate for string-pattern types

Also, it seems that the literal use cases brought up have been unsatisfactory to you, because they try and do silly things like email validation, or use regexes that aren't accurate enough.

Writing tests - this is where hardcoded inputs make some sense, though this is almost a counterpoint because your test code should probably be providing lots of invalid inputs

A good use case brought up was writing run-time tests. And you are right, that they should be throwing a lot of invalid inputs at it for run-time tests, too.

But that's no reason to not support string-pattern types. It might be the case that they want to test valid inputs in a certain file and accidentally give invalid input.

But, because they have to use a type guard or trust-me-operator (as) or value object, now they'll get a run-time error, instead of knowing that the test will fail ahead of time.

Using the trust-me-operator (as) for run-time tests should only be reserved for testing invalid inputs. When wanting to test valid inputs, it is more clear to not need hacks to assign literals to a nominal/structural tag type.

If they ever change the regex in future, it would be nice if their tests now fail to even run, because of assignability issues. If they just us as everywhere in their tests, they won't know until they run the tests.

And if the library author just uses as everywhere when dogfooding their own library... What of downstream consumers? Won't they also be tempted to use as everywhere and run into run-time problems when upgrading to a new version?

With string-pattern types, there's less reason to use as everywhere and both library author and downstream consumers will know of breaking changes more easily.

(Kind of long winded but I hope some of my points got through).

Also, I write a lot of compile-time tests (And I know the TS team does so, too).

It would be nice if I can test that a certain string literal will fail/pass a regex check in my compile-time tests. At the moment, I can't have compile-time tests for these things and need to use a run-time test, instead.

And if it fails/passes my compile-time tests, then I'll have confidence that downstream consumers can use those string literals (or similar ones) and expect them to behave the right way.

It seems like this quickly puts us on the road to a Tower of Babel situation...

This is even more true of using nominal/structural tag types, actually. As the above examples have shown, they do terribly for cross-library/version compatibility...

However, regexes/string-pattern types have a decent chance at not falling into that problem (hopefully, thanks to standardization, and sane library authors).

EDIT

A common thing in the thread is that these regexes would help validate test code, because even though in production scenarios the code would be running against runtime-provided strings rather than hardcoded literals, you'd still want some validation that your test strings were "correct". This would seem to be an argument for nominal/tagged/branded strings instead, though, since you'd be writing the validation function either way, and the benefit of tests is that you know they run exhaustively (thus any errors in test inputs would be flagged early in the development cycle).

Ah... I should have read everything first before writing this...

Anyway, I do have some examples with me, where string-pattern types are useful.

HTTP Route Declaration Library

With this libary, you can build HTTP route declaration objects. This declaration is used by both client and server.

/*snip*/
createTestCard : f.route()
    .append("/platform")
    .appendParam(s.platform.platformId, /\d+/)
    .append("/stripe")
    .append("/test-card")
/*snip*/

Thse are the constraints for .append(),

String literals only (Can't enforce this at the moment, but if you use non-literals, the route declaration builder becomes garbage)
Must start with leading forward slash (/)
Must not end with trailing forward slash (/)
Must not contain conlon character (:); it is reserved for parameters
Must not contain two, or more, forward slashes consecutively (//)

Right now, I only have run-time checks for these, that throw errors. I would like downstream consumers to have to follow these constraints without needing to read some Github README or JSDoc comment. Just write the path and see red squiggly lines.

Other stuff

I also have regexes for hexadecimal strings, alphanumeric strings.

I also have this,

const floatingPointRegex = /^([-+])?([0-9]*\.?[0-9]+)([eE]([-+])?([0-9]+))?$/;

I see this,

Integer - incorrectly rejects "3e5"

I also have this, which isn't an integer regex but uses the floatingPointRegex,

function parseFloatingPointString (str : string) {
    const m = floatingPointRegex.exec(str);
    if (m == undefined) {
        return undefined;
    }
    const rawCoefficientSign : string|undefined = m[1];
    const rawCoefficientValue : string = m[2];
    const rawExponentSign : string|undefined = m[4];
    const rawExponentValue : string|undefined = m[5];

    const decimalPlaceIndex = rawCoefficientValue.indexOf(".");
    const fractionalLength = (decimalPlaceIndex < 0) ?
        0 :
        rawCoefficientValue.length - decimalPlaceIndex - 1;

    const exponentValue = (rawExponentValue == undefined) ?
        0 :
        parseInt(rawExponentValue) * ((rawExponentSign === "-") ? -1 : 1);

    const normalizedFractionalLength = (fractionalLength - exponentValue);
    const isInteger = (normalizedFractionalLength <= 0) ?
        true :
        /^0+$/.test(rawCoefficientValue.substring(
            rawCoefficientValue.length-normalizedFractionalLength,
            rawCoefficientValue.length
        ));
    const isNeg = (rawCoefficientSign === "-");

    return {
        isInteger,
        isNeg,
    };
}

I also have this comment, though,


/**
    Just because a string is in integer format does not mean
    it is a finite number.

    ```ts
    const nines_80 = "99999999999999999999999999999999999999999999999999999999999999999999999999999999";
    const nines_320 = nines_80.repeat(4);
    //This will pass, 320 nines in a row is a valid integer format
    integerFormatString()("", nines_320);
    //Infinity
    parseFloat(nines_320);


-----

### RegExp Constructor

Funnily enough, the `RegExp` constructor will benefit from regex-validated string types!

Right now, it is,
```ts
new(pattern: string, flags?: string): RegExp

However, we could have,

new(pattern: string, flags?: PatternOf</^[gimsuy]*$/>): RegExp

AnyhowStep commented 4 years ago

TL;DR (Please read it, though, I put in a lot of effort into this :cry: )

String-pattern types are more ergonomic than nominal/strucural tag types
- Less boilerplate
String-pattern types are less likely than nominal/strucural tag types to become a Tower of Babel situation
- Especially with regex subtype checks
String-pattern types are the most natural way of defining large finite/infinite subsets of the string type
- Introducing this feature might even make people think about valid string formats for their libraries more closely!
String-pattern types enable stronger compile-time safety for some libraries (Let me get back to you on prevalence... runs away)
- RegExp constructor, hex/alphanumeric strings, route path declarations, string identifiers for databases, etc.

Why are your regexes so bad?

A bunch of the use cases brought up by others wanted to introduce string-pattern types to fit existing libraries; and it doesn't seem to be convincing the TS team.

Often times, I feel like these existing libraries don't even use regexes that much to validate their input. Or, they use a regex to perform a simple validation. Then, they use a more complicated parser to perform the actual validation.

But this is an actual valid use case for string-pattern types!

String-pattern types to validate supersets of valid string values

Sure, a string that starts with /, does not end with /, does not contain consecutive /, and does not contain : will pass the "HTTP path regex". But this just means that the set of values that pass this regex is a superset of valid HTTP paths.

Further down, we have an actual URL path parser that checks that ? is not used, # is not used, some characters are escaped, etc.

But with this simple string-pattern type, we've already eliminated a large class of the common problems that a user of the library may encounter! And we eliminated it during compile-time, too!

It's not often that a user will use ? in their HTTP paths, because most are experienced enough to know that ? is the start of a query string.

I just realized you already know of this use case.

This thread implies a wide variety of use cases; concrete examples have been more rare. Troublingly, many of these examples don't seem to be complete - they use a RegExp that would reject valid inputs.

So, sure, a lot of the regexes proposed aren't "complete". But as long as they don't reject valid input, it should be okay, right?

It's okay if they allow invalid input, right? Since we could have a "real" parser during run-time handle the full validation. And a compile-time check can eliminate a lot of common problems for downstream users, increasing productivity.

Those examples that reject valid input should be easy enough to modify, so that they don't reject valid input, but allow invalid input.

String-pattern types and intersections

Anyway, intersection types on string-pattern types would be super duper useful!

My .append() example could be written as,

append (str : (
  //Must start with forward slash
  & PatternOf</^\//>
  //Must not end with forward slash
  & PatternOf</[^/]$/>
  //Must not have consecutive forward slashes anywhere
  & not PatternOf</\/\//>
  //Must not contain colon
  & PatternOf</^[^:]+$/>
)) : SomeReturnType;

The not PatternOf</\/\//> could also be, PatternOf</^((([/])(?!\3))|[^/])+$/> but this is so much more complicated

denis-sokolov commented 4 years ago

Thank you, @AnyhowStep, for the extensive demonstrations. I wanted to criticize you for making me read so much, but it turned out to be very helpful!

I often struggle with typing my internal apis full of string parameters, and I inevitably end up with a lot of conditionals that throw at run-time. Inevitably, my consumers need to duplicate these pattern checks, since they don’t want an exception, they want a special way to handle the failure.

// Today
function createServer(id: string, comment: string) {
  if (id.match(/^[a-z]+-[0-9]+$/)) throw new Error("Server id does not match the format");
  // work
}

// Nicer
function createServer(id: PatternOf</^[a-z]+-[0-9]+$/>, comment: string) {
  // work immediately
}

In the world of strings and patterns, a generic string is pretty much the same as unknown, removing a lot of type safety in favor of runtime checks, and causing inconvenience for my consuming developers.

Previous Next

microsoft / TypeScript

Suggestion: Regex-validated string type #6579

Is it any useful?

Problematic Aspects of Proposals so far

Types Creating Emit

Bans on Intersections

Ergonomics

Collection and Clarification of Use Cases

Compile-Time Validation of Literals

Enforcement of Runtime Checks

Property Keys / Regex String Indexers

Competing Proposals

Nominal or Tagged types

Non-Issues

Host Capabilities

Performance

Subtyping (/\d*/ -> /.*/ ?), Union, Intersection, and Uninhabitability

Next Steps

Use Case

Comparison

String-pattern types

Nominal types...

Structural tag types...

Conclusion

Conclusion (Extra)

Ergonomics

Cross-Library Compatibility

Cross-Version Compatibility

Boilerplate

Literal Invocation

Non-Literal Invocation

EDIT

HTTP Route Declaration Library

Other stuff

Why are your regexes so bad?

String-pattern types to validate supersets of valid string values

String-pattern types and intersections

Subtyping (`/\d/` `->` `/./` ?), Union, Intersection, and Uninhabitability