microsoft / TypeScript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
https://www.typescriptlang.org
Apache License 2.0
99.17k stars 12.3k forks source link

Regex-validated string types (feedback reset) #41160

Open RyanCavanaugh opened 3 years ago

RyanCavanaugh commented 3 years ago

This is a pickup of #6579. With the addition of #40336, a large number of those use cases have been addressed, but possibly some still remain.

Update 2023-04-11: Reviewed use cases and posted a write-up of our current evaluation

Search Terms

regex string types

Suggestion

Open question: For people who had upvoted #6579, what use cases still need addressing?

Note: Please keep discussion on-topic; moderation will be a bit heavier to avoid off-topic tangents

Examples

(please help)

Checklist

My suggestion meets these guidelines:

SadiePi commented 1 year ago

I'm attempting to strictly type algebraic notation for chess. Here's how I attempted to type a move:

export type File = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h"
export type Rank = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
export type Square = `${File}${Rank}`
export type ShortPieceType = "P" | "N" | "B" | "R" | "Q" | "K"
type AMPiece = Exclude<ShortPieceType, "P">
type AMOrigin = `${File | ""}${Rank | ""}`
type AMDestination = `${"x" | ""}${Square}`
type AMPromotion = `=${Exclude<ShortPieceType, "P" | "K">}` | ""
type AMCheck = "+" | "#" | ""
export type AlgebraicMove = `${AMPiece}${AMOrigin}${AMDestination}${AMPromotion}${AMCheck}`

(This might have mistakes, I don't particularly care, this is just a use case)

Unsurprisingly, AlgebraicMove ends up too complex, and this doesn't even cover castling. A regex type would solve this problem for me. Of course, I could simply use a type guard that checks against a regex anyway, but IMO that defeats the purpose of strict typing because I could just forget to use it somewhere and then I have a plain string masquarading as an AlgebraicMove, exactly the type of bug that strict typing is supposed to prevent.

leonadler commented 1 year ago

@SadiePi while this is a hack, and would be nicer to have proper TS support, for the time being you could use a typed string:

export type AlgebraicChessMove = string & { __kind__: 'AlgebraicChessMove' };

// const patternWithoutGroups = /^([NBRQK])(([a-h]?)([1-8]?))x?(([a-h])([1-8])))([NBRQ]?)([+#]?)$/;
const pattern =
  /^(?<piece>[NBRQK])(?<origin>(?<originFile>[a-h]?)(?<originRank>[1-8]?))x?(?<destination>(?<destinationFile>[a-h])(?<destinationRank>[1-8])))(?<promotion>[NBRQ]?)(?<check>[+#]?)$/;

export function isAlgebraicChessMove(v: unknown): v is AlgebraicChessMove {
  return typeof v === 'string' && pattern.test(v);
}

export function assertAlgebraicChessMove(
  v: unknown
): asserts v is AlgebraicChessMove {
  if (!isAlgebraicChessMove(v)) {
    throw new Error(`Not in algebraic chess move notation: ${v}`);
  }
}

export function asAlgebraicChessMove(v: string): AlgebraicChessMove {
  assertAlgebraicChessMove(v);
  return v;
}

export function saveMoveToDatabase(move: AlgebraicChessMove) {
  // You would store the move to the database, or whatever
}

// Now this is at least a bit safeguarded - although TypeScript can not validate the string contents at compile time
const validMove = asAlgebraicChessMove('Kc3d4#');
const maybeInvalidMove = 'Ra1a5+';

saveMoveToDatabase(validMove);
saveMoveToDatabase(maybeInvalidMove);

assertAlgebraicChessMove(maybeInvalidMove);
saveMoveToDatabase(maybeInvalidMove);

image


To have compile-time string safety you could "help" the compiler break down the complexity by using an infer-chained mapped type (the main problem in your example is multiple 'something' | '' chained together):

type IsAlgebraicChessMove<T extends string> =
  // match "${AMPiece}"
  T extends `${'N' | 'B' | 'R' | 'Q' | 'K'}${infer R1}`

    // match "${AMOrigin}${AMDestination}", but only WITH optional file & rank
    ? R1 extends `${File}${Rank}${'x' | ''}${File}${Rank}${infer R2}`
      // match "${AMPromotion}${AMCheck}"
      ? ('' extends R2 ? true : R2 extends `${'N'|'B'|'R'|'Q'|''}${'+'|'#'|''}` ? true : false)

    // match "${AMOrigin}${AMDestination}" only WITHOUT optional file & rank
    : (R1 extends `{'x' | ''}${File}${Rank}${infer R2}`
      // match "${AMPromotion}${AMCheck}"
      ? ('' extends R2 ? true : R2 extends `${'N'|'B'|'R'|'Q'|''}${'+'|'#'|''}` ? true : false)

      : false)
    : false;

function isInAlgebraicChessNotation<T extends string>(str: T): IsAlgebraicChessMove<T> {
  return pattern.test(str) as IsAlgebraicChessMove<T>;
}

This seems to work from my testing: image

zepumph commented 1 year ago

Reiterating on https://github.com/microsoft/TypeScript/issues/41160#issuecomment-831846373 today. My coworkers and I at PhET Interactive Simulations encountered this today while trying to use Template Literal Types to type check on keyboard support we are adding to our product. This would be much easier to accomplish with regex. While the immediate solution seems to be to allow a larger number of string literal unions in each template, that feels more like a workaround to me, thus a post to this issue!


type AllowedKeys = 'q' | 'w' | 'e' | 'r' | 't' | 'y' | 'u' | 'i' | 'o' | 'p' | 'a' | 's' | 'd' |
  'f' | 'g' | 'h' | 'j' | 'k' | 'l' | 'z' | 'x' | 'c' |
  'v' | 'b' | 'n' | 'm' | 'ctrl' | 'alt' | 'shift' | 'tab' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' |
  'space' | 'arrowLeft' | 'arrowRight' | 'arrowUp' | 'arrowDown' | 'enter' | 'equals' | 'plus' | 'minus' | 'escape' |
  'delete' | 'backspace' | 'page_up' | 'page_down' | 'end' | 'home';

type OneKeyStroke = `${AllowedKeys}` |
  `${AllowedKeys}+${AllowedKeys}` |
  `${AllowedKeys}+${AllowedKeys}+${AllowedKeys}`; // <-- on this line: TS2590: Expression produces a union type that is too complex to represent.
Shdorsh commented 1 year ago

I would love to have something like that, especially for typing props with Vue so some people don't forget the unit behind sizes. Since regexp are seen as variables, one could have a keyword to prefix strings/regexes and try to apply them to the current variable, something like that: Exporting that saved regex, which is probably reused with vee-validate: export const size = /^\d+((pt)|(pn)|(pc)|%|(em)|(rem)|(vh))$/

Typing a prop of a component: const props = defineProps<{ height: typedregex size; }>();

zm-cttae commented 1 year ago

Picking this up again: the reason why A. matchof keyword or B. Validated intrinsic was proposed in addition to https://github.com/microsoft/TypeScript/issues/6579#issuecomment-261519733 is because types and values must not intermix (see https://github.com/microsoft/TypeScript/issues/41160#issuecomment-997197109). This enables types to be compiled out.

String literal types are somewhat intuitive but regex literals as string matcher types seems less so.

Regex superclasses were mentioned above as a blocker to original #6579 proposal, but now we also have #38671 targeting regex literals. Possibly relevant goals that conflict with https://github.com/microsoft/TypeScript/issues/6579#issuecomment-261519733 :

Admittedly the number of proposals might be a symptom of bikeshedding here.

Moto42 commented 1 year ago

I just wanted to post in my support for this feature.

One of the first things I tried when I learned TypeScript was, "I wonder if...."

var mustInclude5: /5/ = "I am 5 today!";

I was genuinely surprised that...

var thisMadness: `${is}${how}${it}${is}${done!}`;
zm-cttae commented 1 year ago

Again that would allow this (by definition per TypeScript rules):

var mustInclude5: /5/ = /5/;

Alternatives include:

var mustInclude5: matchof /5/ = '159';
var mustInclude5: Validated<'5'> = '159';
ljharb commented 1 year ago

Another alternative would be string</5/> which might also allow number</1/> etc.

zm-cttae commented 1 year ago

Soo.. type casts? Interesting. Definitely an increase in language complexity compared to the other two though! But we can do other interesting things with this.. such as getting all the JS string representations of a number. </tangent>

ljharb commented 1 year ago

not a typecast - string</5/> would be "a string that matches the regex /5/", and number</1/> would be "a number that, when stringified, matches the regex /1/".

zm-cttae commented 1 year ago

It would work and make sense no doubt. That seems to be a significant change for primitive types admittedly. Just might be easier in compiler to pick up matchof keyword or Type intrinsic during tokenization.

I want this badly enough that I'm okay with any of the three syntaxes! .. this issue has had plenty of feedback, so it could do with a community PR or feedforward from @RyanCavanaugh

RyanCavanaugh commented 1 year ago

We went through another round of discussion on this as part of a review of highly-upvoted issues.

Just to start with a meta-point: Use cases are so much more valuable to us in terms of feature prioritization than general "yes I want this please" comments. If we're evaluating a use case, we can talk about whether or not other features possibly on the table (either more general or more targeted) can be used instead. If we're just looking at "yes please", we can't make any generalizations about broader scenarios or guess as to what the bounds of the feature on either end need to be.

For example, we have questions where we need to know the use cases to understand things like:

All of these answers influence the complexity and desirability of the feature and are just incredibly important. So, again, use cases please!

Anyway, back to the topic at hand

A key distinction to think about is static data (string literals which appear in your program's source code) vs dynamic data (strings of unknown exact content given to you from either a computation or an external data source).

We think the strongest argument in favor of the feature is as it relates to static data. A static string can obviously be checked against a regex; this is trivial. A basic example might be something like

// For parsing strings containing numbers
declare function parseInt2(s: /^\d+$/): number;

This is easy to reason about; parseInt2("123") is valid, parseInt2("abc") is not.

Dynamic data is more difficult to reason about. How do we call parseInt2 with dynamic data?

function foo(s: string) {
    // Not OK, obviously
    return parseInt2(s);
}
function foo(s: string) {
    // Maybe what you have to write?
    if (/^\d+$/.exec(s)) {
        return parseInt2(s);
    }
}

But this solution isn't great, because we had to write the regular expression twice: Once in type space, and once in value space. Maybe you could imagine an operator to produce a regex type from a regex value:

const digitsOnly = /^\d+$/;
declare function parseInt2(s: Regex<typeof digitsOnly>): number;
function foo(s: string) {
    if (digitsOnly.exec(s)) {
        return parseInt2(s);
    }
}

This is... fine? But it's very, very close to just nominal typing, and it'd be a real anti-pattern to have regex types just be a weird way that you write nominal types that are subtypes of string; the correct feature to implement to handle some of these dynamic cases probably is nominal or nominal-like typing (indeed this is how our own code handles things like "a path that has been normalized", which is technically something that you could use a regex for, but not a great fit).

And while this logic is fine for data and logic within the confines of your own program, it's not a pattern that's well-manifested in terms of how existing libraries describe their inputs. Most functions with implicit data formats aren't also publishing a canonical regex for their data format. There's also a problem of the implicit subtyping behavior you'd want here -- what if you tested for /^\d\d\d$/ instead of /^\d+$/? Programmers are very particular about what they think the "right" way to write a regex are, so the feature implies either implementing regex subtyping so that the subset behavior can be validated, or enduring endless flamewars in places like DT as people argue about which regex is the correct one for a given problem.

Moving on to thinking about the use cases, looking through the issue, I'm seeing these:

A common characteristic here is that these are barely finite, in the sense that it's technically possible to enumerate all IPv4 or IPv6 addresses, but not in a practical amount of time. This makes sense to see as an observed property, since infinite domains can usually be represented with template string literal types today. Alternatively, some of these are infinite (e.g. paths, which are basically unbounded) but difficult to represent in template string literals since those don't handle repetition very well. An aside obervation is that it seems interesting to consider the possibility of a method of writing template string literals which technically could be expanded into a finite union, but stays in unexpanded form by some mechanism or another.

The other common thread is that none of us could make much sense of why most of these kinds of strings would be hardcoded throughout a program in a way that would require a static type checker to be involved. It's extremely understandable that you might want to talk about the abstract notion of a string representing an IP address, but it's less understandable that you'd have more than a tiny handful of actual string literals, e.g. "192.168.0.1", in a program. The distinction here is important since the sort of defining problem of why regex types seem hard to integrate into the language is that it's difficult to soundly apply the sorts of narrowings you would need on dynamic data in ways that are more usefully ergonomic than things you can (and probably) should do already.

Sorting the use cases, it seems like we have:

Re: a correct* regex, here we mean a regular expression which has both zero false negatives and nearly zero false positives. For example, while "JSON object" can be validated to e.g. start and end with {/}, a regular expression to correctly predict whether JSON.parse will throw is not plausible.

Regarding the most promising category, Things that seem likely to appear in code with reasonable frequency with a good and correct regex, it doesn't really seem like template literal strings are that far off from solving this problem either. There are places where e.g. you by convention must or must not start a color with #, or a hex value with 0x. Template string literals do fine at this at the expense of underspecifying the rest of the string, but it's just not obvious that the specific case of writing a string literal "0xABCD" in a language where 0xABCD is valid requires an entire feature built around it. Similarly, there are very few (if any) places in HTML more broadly where the only #FF0000 is legal; the presumption that rgba(1, 1, 1, 1) isn't also legal is likely wrong.

So, overall, this currently isn't seeming like a very good fit. We'll continue to re-evaluate the use cases as they appear.

zm-cttae commented 1 year ago

Lots of detail here! Can we push back on this a little? A schtick we probably have left is UUID or SHA hash of specific length. A hash or illegal ID character would cause the program to blow or invalid data in production DB.

arcanis commented 1 year ago

Template string literals do fine at this at the expense of underspecifying the rest of the string,

This approach doesn't scale, as it requires every function accepting those "literal string types" to become generics, which brings a large amount of boilerplate, complexity, edge cases. For instance, I tried a couple of times to do this to type the functions from an fs interface, and I couldn't find a reasonable way to make it work.

hanneswidrig commented 1 year ago

I don't necessarily have a solution but I want to summarize the place where template string literal types become limited. One of the best features TypeScript added several years ago was Variadic Tuple Types.

Before TypeScript 4.0, you had to write something like, as noted in the announcement, this simply does not scale where you need to account for one to many arguments.

// before
function concat(arr1: [], arr2: []): [];
function concat<A>(arr1: [A], arr2: []): [A];
function concat<A, B>(arr1: [A, B], arr2: []): [A, B];
function concat<A, B, C>(arr1: [A, B, C], arr2: []): [A, B, C];
function concat<A, B, C, D>(arr1: [A, B, C, D], arr2: []): [A, B, C, D];
function concat<A, B, C, D, E>(arr1: [A, B, C, D, E], arr2: []): [A, B, C, D, E];

// after
type Arr = readonly any[];

function concat<T extends Arr, U extends Arr>(arr1: T, arr2: U): [...T, ...U] {
    return [...arr1, ...arr2];
}

I face the same problem when using template string literal types,

image

Cross multiplication of this union type is performance problem if it gets just a little too expansive.

How might a different strategy like regex-validated types offer an improvement here?

RyanCavanaugh commented 1 year ago

Can we push back on this a little? A schtick we probably have left is UUID or SHA hash of specific length. A hash or illegal ID character would cause the program to blow or invalid data in production DB.

Again, this is confusing as a use case: How are there are nontrivial number of loose UUIDs or SHAs in your program? I would not want to work in a codebase that was full of nonsymbolic UUIDs or SHAs; it'd be a nightmare to see something like

declare function doLoginFlow(name: string, id: UUID): void;
doLoginFlow("ryan", "123e4567-e89b-12d3-a456-426614174000");

If I had to be dealing with calls like this, I would absolutely want a symbolic name for my UUIDs, and at that point it's trivial to validate them during program initialization

function validateUUIDs<const T>(obj: T) {
  if (Object.values(obj).some(v => !isValidUUID(v))) throw new Error("typo");
  return obj;
}

const myUUIDs = validateUUIDs({
  ryan: "123e4567-e89b-12d3-a456-426614174000"
});
ljharb commented 1 year ago

Without a nominal type there, though, it's still just an object of strings - not an object of UUIDs, which is a subset of that. The use case for me is that "string" or "number" is far too broad - i want to be able to typecheck that a value is of a type that's more granular than TS currently offers.

RyanCavanaugh commented 1 year ago

Right, but presumably you want nominal types for all kinds of nominality, not just the subset of nominality that can be expressed with a regular expression

ljharb commented 1 year ago

Very true :-) but I'd take what I can get.

fabiospampinato commented 1 year ago

Again, this is confusing as a use case: How are there are nontrivial number of loose UUIDs or SHAs in your program? I would not want to work in a codebase that was full of nonsymbolic UUIDs or SHAs;

Maybe I'm missing something, but I see some benefits from regex types that go beyond working with hard-coded strings here and there.

For example, today I might have code like this:

const getById = (id: string) => {...};
const getByTitle = (title: string) => {...};

Which I might then call by mistake like this:

getById(document.title);
getByTitle(document.title);

No type-level errors, the id is a string and the title is a string, it makes sense, but not really -- document.id should really be of type UUID or something.

As I'm understanding it template literal types are nowhere near being able to support this use case today. Like there's just no way to have types like that working today.

Will this use case be eventually covered by something different from regex types, or is this use case not worth covering?

RyanCavanaugh commented 1 year ago

I think the relevant observation there is that different domains of values with the same underlying (usually primitive) storage format often still need to be differentiated. This is nearly orthogonal to regexes: You could have numeric IDs and still not want to mix up CustomerID with OrderID (even though their domains overlap perfectly); you could have angle measurements in degrees or radians; you could have x/y coordinate pairs with the positive Y axis pointing up or down.

The fact that some of those domains can be roughly described with a regular expression is mostly a coincidental observation.

The problem with the code sample isn't a failure to pass a regex (indeed there is no regex); it's a failure to keep the domain values separated. You can tell because if you made a UUID regex type here, it wouldn't matter what regex you wrote! The typecheck behavior would be the same even if you said type UUID = /hello world/;. That's a good indicator that something has gone astray in our mental model here.

If we're going to have nominal types, we should just have nominal types. It'd be a big mistake to implement regular expression types for the sake of solving 15% of the use cases in a way where the regular expression itself is wholly immaterial.

zm-cttae commented 1 year ago

Admittedly any kind of hash or regularised character sequence would be covered by a fixed-length string union.

type Char = string & { length: 1 } & Array<string, 1>;
type Hash = string & { length: 4 } & Array<string, 4>;
fabiospampinato commented 1 year ago

@RyanCavanaugh I see, thanks for the clarification, that made it click for me I think 😅

I guess it's good to not solve 15% of the problem today at the cost of locking the language into the wrong abstraction.

yanndinendal commented 1 year ago

I have a use-case in https://github.com/microsoft/TypeScript-DOM-lib-generator/pull/1467#discussion_r1111694766 where a template string is not ideal: match a word until (or excluding any space). We could do something with generic functions but not too type literals. I thing the use-case could be extended to any space- or comma-separated string.

Griffork commented 1 year ago

Ooh! I can help here! I have a program with a "nontrivial number of loose UUIDs or SHAs" written in typescript (assuming loose means variables containing UUIDs and not hardcoded UUIDs).

I have a game engine!

All entities, events and processes have UUIDs, and stuff that is derived from those have string IDs which are a combination of one or more UUIDs.

I have a UUID type specified as so:

class UUID {
   #private: true;
}

A function isUUID that takes any object and returns obj is UUID.

And a function makeUUID that returns a UUID type (though in practice it just returns a string).

What I would really like is a type where no strings are assignable to it (except using the isUUID function) and it's not assignable to strings without being cast, it's really annoying using .toString every time I create debug strings or compound IDs (which is often), I'd really like to use id as string.

I'd like to be able to compare a UUID to a string or use cid.contains(uid).

Additionally sometimes it would be nice to say str is UUID when I know something will always be the correct type.

I will eventually be using checksums as well for verifying valid game state on the client. I imagine that the usage of that will be similar but not nearly as widespread.

I probably wouldn't mind too much if UUIDs are assignable to strings without a cast, though I'd be sad to lose the extra type enforcement, it'd make sense.

Let me know if you want example code. I wom't share the whole repo but I'm happy to show some problematic bits and pieces.

zm-cttae commented 1 year ago

FWIW, SHA hashes are cryptographically primitives. They only have one fixed length and probably don't need validation outside of a record readwrite. There otherwise isn't much need to be validating the output of the SHA libraries in Node.js and Web APIs.

EDIT: typing a defined fixed length appears to be doable in current Typescript so long as you don't enforce the character set of the stream:

https://stackoverflow.com/questions/41139763/how-to-declare-a-fixed-length-array-in-typescript

gregor-mueller commented 1 year ago

My use case is hex color values with the native color picker.

I maintain a UI library and we are providing a component with a color picker. I wanted to use the native one but when I tried it out, with values like red, #FFF, #FFFFFFCC, it will always have a black color.

I tried to create a simple type in order to inform any developer using our component, that it will only work properly with 6-digit hex values. But I only got so far, that it will require the # at the beginning (using a template literal type).

type HEX = `#${string}`

I could go and use something like zod but this will still only run during runtime. Another workaround of course is to throw a warning into the console and hope that the developer will read it. But.. that's why TypeScript is so great, I can tell others how to use the component right within the IDE.

I tried this, but it is too complex for the compiler:

type Character = 'a' | 'b' | 'c' | 'd' | 'e' | 'f'

type HexCharacter = number | Character | Uppercase<Character>

type HEX = `#${HexCharacter}${HexCharacter}${HexCharacter}${HexCharacter}${HexCharacter}${HexCharacter}`

Result: image And it will also not properly check the inputs (false-positives, false-negatives, kinda random).

Also tried similar but different approaches (adding uppercase versions directly to the union, etc) but to no avail.

So my proposal for this would be adding lengths to the string and number primitives (don't forget about bigInt).

We could easily supercharge template literal types with that.

type Letter = string(1)

type Digit = number(1)

type HEX = `#${string(6)}`

type UUID = `${string(8)}-${string(4)}-${string(4)}-${string(4)}-${string(12)}`

This is still lacking the option of restricting the characters (a-f).

If we take this one step further and apply it to Array, we can also get this fixed:

type Character = 'a' | 'b' | 'c' | 'd' | 'e' | 'f'

type HexCharacter = number | Character | Uppercase<Character>

type HEX = `#${Array<HexCharacter>(6)}`

// or

type AlternativeHEX = `#${HexCharacter[](6)}`

type UUID = `${Character[](8)}-${Character[](4)}-${Character[](4)}-${Character[](4)}-${Character[](12)}`

Note: The UUID sample is a bit simplified.

ArtemAvramenko commented 1 year ago

In my code I use an approach where there is static type checking, though not full-fledged, but still. In developer mode real checks are also run in runtime. In production this adds minimal overhead.

The type declaration looks like this (see gist for full details):

interface UtcDateString extends String { __format: 'UTC Date'; }
declare interface String {asUtcDateString(): UtcDateString; }
addStringCheckMethod<UtcDateString>(
    'asUtcDateString',
    /^\d{4}\-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:.\d{1,3})?Z$/);    

Example usage:

// type: UtcDateString extends String
let utcDate = '2000-12-31T12:00:00.000Z'.asUtcDateString();

// type: LocalDateString extends String
let localDate = '2000-12-31T08:00:00.000'.asLocalDateString();

// tsc error: Type '"Local Date"' is not assignable to type '"UTC Date"'
utcDate = localDate;

I understand that decorators are not suitable for this, but it would be nice to reduce this code to something like:

@addStringCheckMethod('asUtcDateString', /^\d{4}\-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:.\d{1,3})?Z$/)
interface UtcDateString extends String { }
zm-cttae commented 1 year ago

@gregor-mueller I'm busy atm, but did you try using the StackOverflow solution for arrays of fixed length? https://stackoverflow.com/questions/41139763/

dpalay commented 1 year ago

I've had several use cases come up, but the one that came up most recently is wanting to index an object by date string. That is, I want an object that I can index with strings of a specific format:


const dateInYYYYMMDD = /^\d{4}\-(0[1-9]|1[012])\-(0[1-9]|[12][0-9]|3[01])$/

interface IUser {
logForADate: {
[date: RegEx<dateInYYYYMMDD>]: ILog
}

such that later, I could have

...
user.logForADate["2023-04-03"] // valid
...
const date = "2023-04-03"
user.logForADate[date] // valid
...
const someOtherDate = Date()
user.logForADate[someOtherDate] // error
...
const nonDateString = "hello there"
user.logForADate[nonDateString] // error
matthew-dean commented 1 year ago

@RyanCavanaugh I feel like it's a bit hand-wavy to say that the use cases are probably not valid and that there's probably something wrong with the mental model or the code. Also, you're making a fundamental logic flaw that the purpose of TypeScript is to only provide helpful suggestions for your own code. TypeScript is often used to provide guidance and feedback via a kind of "type API" for consumers of a library. In many cases, types are defined on a library to indicate whether the input someone is using is valid / invalid; in other words, to define error and feedback messages in an IDE. It's not just about will it compile or won't it, or can it be guarded in runtime or not. You can guard against edge cases in runtime with dynamic data, but that doesn't mean that more accurate typing when coding is not useful or is flawed.

I have similar applications as @gregor-mueller in which I want to provide feedback for very-well-defined types such as CSS colors, in which I would want the consumer of a library to be provided an error in the IDE and/or by the tsc CLI. Strings are far too broad, and template-tagged types are also way way too broad.

Now, I don't know if Regex-based types are the answer. (For one, of course, it blends well with string subset types but is a little trickier with numerical types.) Another possibility is what you see in parser grammars like Antlr or Nearley in which a very specific subset of Regex is used to define some typing rules.

I think a better way to frame this thread is the common feedback that the current way to define types is, while powerful in many cases, extremely limited in others and causes TypeScript to bork immediately. Being able to limit a string type to just a CSS hex color should absolutely be possible and absolutely is valid.

matthew-dean commented 1 year ago

I wonder if there's an opportunity here to just pave the cowpaths, and lean on Deepkit's well-thought-out system for constraining types.

It's a very elegant syntax. For example, to constrain to a regex pattern, you can use:

const myRegExp = /[a-zA-Z]+/;
type T = string & Pattern<typeof myRegExp>

// presumably you can use Pattern<typeof /[a-zA-Z]+/> to not have the regex at runtime?

As mentioned in this thread, regex obviously doesn't cover number types. Deepkit allows that too:

type T = number & Minimum<10> & Maximum<1000>;
type PositiveNumber = number & Positive

Now, DeepKit's system is partially to provide deep-typing that integrates with runtime validation, but the runtime is not a necessity in order to build this into TypeScript's native type-checking.

IMO, a pattern of <native-type> (& <constraint-type>)* would be far more flexible than just "regex-validated types" and would blend with the current mental model of types.

Re-writing some examples in this thread, you'd get:

type UTCDate = typeof /^\d{4}\-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:.\d{1,3})?Z$/
type UTCDateString = string & Pattern<UTCDate>
type CssColor = typeof /^#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$/i;
const color: string & Pattern<CssColor> = '#000000'; // OK
matthew-dean commented 1 year ago

@RyanCavanaugh I really would love to chat sometime with the team how TypeScript is used, because it feels like how the TypeScript team thinks TypeScript is used, and how it is actually used in modern web development are vastly different, but that's quite off-topic.

matthew-dean commented 1 year ago

Oh, just a thought about this:

what if you tested for /^\d\d\d$/ instead of /^\d+$/?

These are different types, but maybe you meant something like this:

type FirstRegex = string & Pattern<typeof /^\d\d\d$/>
type SecondRegex = string & Pattern<typeof /^\d{3}$/>

const myValue: FirstRegex = '123'

function doSomething(s: SecondRegex) {
  console.log(s)
}
doSomething(myValue) // should not be an error

The key is that it should not matter how the regex is defined. They're the same type because both have the same sub-type constraint. Parsing systems often manage this by compiling regex and static strings to a common set of constraints. In other words, the following 3 are all the same type:

type One = string & ('a' | 'b')
type Two = string & Pattern<typeof /^[ab]$/>
type Three = string & Pattern<typeof /^[a-b]$/>

The fact that it's a regex shouldn't matter. The fact that the regex is written differently shouldn't matter. Regexes can be statically analyzed.

(Note: It’s not clear to me if a ^ or $ character should be needed to indicate a string is a “full match”. Probably? It depends on how one reasons about the problem.)

pstovik commented 1 year ago

Our use case is about consistent definition for product/service version property

TehShrike commented 1 year ago

Right, but presumably you want nominal types for all kinds of nominality, not just the subset of nominality that can be expressed with a regular expression

I don't believe most (any?) of my use cases overlap with nominal types.

I have a bunch of functions that take/return strings in the format YYYY-MM-DD. I don't care that they are all working off of the same definition of an ISO 8601 date string, I just care that they all match that pattern.

Same goes for decimal numbers stored in strings, or strings that represent country codes. I very rarely run into a case where I want nominal type with TypeScript (though I think it has happened once or twice, I don't think it had to do with strings).

Nominal types might incidentally solve some use cases (I'm probably motivated to only have one definition of country codes around my codebase), but it would be wholly inappropriate for others:

Incidentally, being able to use generics in regex return types would be sweet, so that financial_number.toString(4) could return type /^\d+\.\d{4}$/.

ljharb commented 1 year ago

I'd hope you care about more than the pattern, otherwise 2023-99-00 would be considered a valid date.

TehShrike commented 1 year ago

I'd hope you care about more than the pattern, otherwise 2023-99-00 would be considered a valid date.

it's true, my current type is actually

type Months = '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08' | '09' | '10' | '11' | '12'
type Days = '01' | '02' | '03' | '04' | '05' | '06' | '07' | '08' | '09' | '10' | '11' | '12' | '13' | '14' | '15' | '16' | '17' | '18' | '19' | '21' | '22' | '23' | '24' | '25' | '26' | '27' | '28' | '29' | '30' | '31'

type IsoDate = `${ number }${ number }${ number }${ number }-${ Months }-${ Days }`

😅

which could obviously be improved on with regular expressions. Even with regular expressions it would take some effort to make the type fully reflect valid dates on the gregorian calendar, but I'll take what I can get.

ljharb commented 11 months ago

Whose favorite day isn’t February 31st, after all

shaedrich commented 9 months ago

Not sure but maybe, someone has a use case that can already be solved by this workaround: https://mastodon.online/@dylhunn@towns.gay/109479824045137188

oliveryasuna commented 8 months ago

To address the checklist, assuming this is a compile-time check only:

As long as there is an explicit regex notation. For example, we could use /.

type IntegerString = `${/\d+/}`.

As long as this is a compile-time check only.


This would be an absurdly useful feature. Imagine how smart and type-safe fluent SQL libraries would become.

saltman424 commented 8 months ago

Another thing to add, this isn't just helpful for validation, but also for extracting information. E.g.

type Id<
  TVersion extends Id.Version = Id.Version,
  TPartialId extends Id.PartialId = Id.PartialId,
  TContext extends Id.Context | undefined = Id.Context | undefined
> = TContext extends undefined ? `${TVersion}:${TPartialId}` : `${TVersion}:${TContext}:${TPartialId}`
namespace Id {
  export type Version = /v\d+/
  export namespace Version {
    export type Of<TId extends Id> = TId extends Id<infer TVersion> ? TVersion : never
  }

  export type PartialId = /\w+/
  export namespace PartialId {
    export type Of<TId extends Id> = TId extends Id<any, infer TPartialId> ? TPartialId : never
  }

  export type Context = /\w+/
  export namespace Context {
    export type Of<TId extends Id> = TId extends Id<any, any, infer TContext> ? TContext : never
  }
}

type MyId = Id<'v1', 'myPartialId', 'myContext'> // 'v1:myContext:myPartialId'
type MyPartialId = Id.PartialId.Of<MyId> // 'myPartialId'

This can be done with just string instead of a regular expression, but that leads to ambiguity. In the above example, 'myContext:myPartial' could be interpreted as a single Id.PartialId.

tsujp commented 8 months ago

This constructs a literal string type containing only the allowed characters. If you attempt to pass invalid characters you get back never. This is fine for my usecase (albeit a lot more TypeScript than I'd like for something simple), maybe it will help others until this becomes a smoother experience in TypeScript.

type HexDigit =
   | 0
   | 1
   | 2
   | 3
   | 4
   | 5
   | 6
   | 7
   | 8
   | 9
   | 'a'
   | 'b'
   | 'c'
   | 'd'
   | 'e'
   | 'f'

// Construct a string type with all characters not in union `HexDigit` removed.
export type OnlyHexDigits<Str, Acc extends string = ''> =
   Str extends `${infer D extends HexDigit}${infer Rest}`
      ? OnlyHexDigits<Rest, `${Acc}${D}`>
      : Acc

// Return given type `Hex` IFF it was unchanged (and thus valid) by `OnlyHexDigits`.
export type HexIntLiteral<
   Hex,
   FilteredHex = OnlyHexDigits<Hex>
> =
   Hex extends FilteredHex
      ? Hex
      : never

// Effectively an alias of `HexIntLiteral<'123'>`.
function hexInt<Hex extends string> (n: Hex & HexIntLiteral<Hex>) {
   return n as HexIntLiteral<Hex>
}

// Without the 'alias' form.
declare const t1: HexIntLiteral<'123'> // '123'
declare const t2: HexIntLiteral<'cafebabe'> // 'cafebabe'

// Using the 'alias' form.
const t3 = hexInt('zzzz') // never
const t4 = hexInt('a_b_c_d') // never
const t5 = hexInt('9287319283712ababababdefffababa12312') // <-- that

// Remember, the type is a string literal so `let` is still (as far as TypeScript
//   is concerned) immutable (not _really_).
let t6 = hexInt('cafe123')

t6 = '123' // We (humans) know '123' is valid, but `t6` is a string literal `cafe123`
           //   so this is an error (2232): type '123' not assignable to type 'cafe123'
           //   because we construct a _string literal_ type.

This can likely be simplified but I waste a lot of time code golfing TypeScript types so I abstain this time.

mauriziocescon commented 2 months ago

My case:

const obj = {
  _test1: '1', 
  test2: '2',
  _test3: '3',
  test4: '4',
};

function removeKeysStartingWith_(obj: Record<string, unknown>): Record<string, unknown> {
  const x: Record<string, unknown> = {};

  Object.keys(obj)
    .filter(key => !/^_/i.test(key))
    .forEach(key => x[key] = obj[key]);

    return x;
}

// {"test2":"2", "test4":"4"} 

I cannot express the fact that the return object of a function cannot have keys starting with "_". I cannot define the precise keyof set without a RegExp (to be used in combination with conditional types).

RyanCavanaugh commented 2 months ago

@mauriziocescon template literal strings work fine for this; you don't need regexes

const obj1 = {
  _test1: '1', 
  test2: '2',
  _test3: '3'
};
type RemoveUnderscore<K> = K extends `_${string}` ? never : K;
type NoUnderscores<T> = {
    [K in keyof T as RemoveUnderscore<K>]: T[K];
}
declare function removeKeysStartingWith_<T extends object>(obj: T): NoUnderscores<T>; 
const p1 = removeKeysStartingWith_(obj1);
p1.test2; // ok
p1._test1; // not ok
mauriziocescon commented 2 months ago

Thanks a lot for the instantaneous feedback! I missed that part... 😅

Peeja commented 2 months ago

@mauriziocescon Be careful, though: that type means that you definitely do not know whether any keys beginning with _, not that you know that they don't. Without exact types, TypeScript can't express the latter. But the former is usually good enough.

saltman424 commented 2 months ago

@RyanCavanaugh

Use case

I would like to use this type:

type Word = /^w+$/

I use this as a building block for many template strings. E.g.:

// I mainly don't want `TPartialId` to contain ':',
// as that would interfere with my ability to parse this string
type Id<
  TType extends Type,
  TPartialId extends Word
> = `${Type}:${TPartialId}`

Answers to some of your questions

I use this in a mix of static and dynamic use cases. E.g.

const validId: Id = 'sometype:valid'
// this should not be allowed
const invalidId: Id = 'sometype:invalid:'

declare function createId<TType extends Type, TPartialId extends Word>(
  type: TType,
  partialId: TPartialId
):  Id<TType, TPartialId>
declare function getPartialId<TId extends Id>(
  id: TId
): TId extends Id<any, infer TPartialId> ? TPartialId : Word

declare function generateWord(): Word

I absolutely want to use regular expression types in template literals (as seen in above examples). However, while it would be nice to have, I don't need to be able to use anything within my regular expression types. (e.g. I don't really need type X = /${Y}+/; type Y = 'abc')

I would appreciate the ability to do something like this:

const WORD_REGEXP = /^\w+$/
export type Word = Regex<typeof WORD_REGEXP>
export function isWord(val: unknown): val is Word {
  return typeof val === 'string' && WORD_REGEXP.test(val)
}

However, if I had to write the same regular expression twice, it would still be better than the current state.

I don't think the above part approaches nominal typing. At a high level, regular expression is basically a structural type for a string. You can determine if a string matches the regular expression solely based on the string's contents, ignoring any metadata about the string. With that being said, I do acknowledge that it is harder to determine if a type for a string matches a regular expression, which is where things get kind of nominal. Specifically, to your point:

There's also a problem of the implicit subtyping behavior you'd want here -- what if you tested for /^\d\d\d$/ instead of /^\d+$/? Programmers are very particular about what they think the "right" way to write a regex are, so the feature implies either implementing regex subtyping so that the subset behavior can be validated, or enduring endless flamewars in places like DT as people argue about which regex is the correct one for a given problem.

If you are within one project, you should create one type with whatever the "right" regex for that project is and reference that everywhere. If you are working with a library, you should use the type from that library. Either way, you shouldn't have to recreate a regular expression type in the way that you think is "right." And if you want to add additional restrictions, just use intersection. Although, I do recognize that without subtyping, things do get pretty nominal when determining if types match a regular expression. However, we currently deal with that type of problem with deferred evaluation of type parameters in functions/classes. So semi-nominal types in certain contexts doesn't seem to be a deal-breaker. Although, I do acknowledge deferred type parameters are never fun to deal with

Most functions with implicit data formats aren't also publishing a canonical regex for their data format.

To be fair, the canonical regex doesn't generally matter externally at the moment. If it did matter externally, e.g. it was used in a type, they would be more likely to publish it

Alternative: template string enhancements

I do agree that enhancements to template strings could work. In my use case, these would be sufficient:

  1. Some way to repeat 0+ or 1+ times (maybe circular references - see below)
  2. Preferably, built in utility types for \w, \d, \s, and other similar RegExp features. (e.g. type Digit = '0' | '1' | '2' | ...)

With these, I could do something like:

type WordCharacter = 'a' | 'b' | ... (preferably this is built into TypeScript)
type Word = `${WordCharacter}${Word | ''}` // === /^\w+$/
type WordOrEmpty = Word | '' // === /^\w*$/

However, these would not work if I wanted to do this through negation, which I had thought about. E.g.:

type PartialId = /^[^:]+$/

If you like these enhancements, I can put them in proposals in one or more separate issues

samueldcorbin commented 1 week ago

To add a very straightforward use case to this: custom element names.

Custom element names must begin with a lowercase letter, must have a dash, and are frequently defined as string literals, not dynamically. This seems like something that TypeScript should absolutely be able to handle, it's easy for people to carelessly forget that the elements have to have a dash or must be lowercased, and it's annoying to only get it at runtime.

Sometimes people define custom element names dynamically, but they define them as literals often too. It would be nice if we could at least check the literals, even if we can't check the dynamic ones.

On the whole, the discussion of this proposal is extremely frustrating to read. The evaluation begins with "Checking string literals is easy and straightforward". Great. So why is adding an easy and straightforward thing being held up for literal years by discussion about maybe adding much less easy and much less straightforward things?

I understand the general sentiment that you want to be careful about making a simple syntax for the easy case that accidentally blocks future extension of functionality when you get to the hard cases, but that doesn't look like an issue here. Maybe capture groups would be useful, maybe dynamic strings would be useful. But adding support for string literals and regex without capture groups is easy and doesn't block adding support for dynamic strings and capture groups later.