microsoft / TypeScript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
https://www.typescriptlang.org
Apache License 2.0
99.06k stars 12.29k forks source link

Regex-validated string types (feedback reset) #41160

Open RyanCavanaugh opened 3 years ago

RyanCavanaugh commented 3 years ago

This is a pickup of #6579. With the addition of #40336, a large number of those use cases have been addressed, but possibly some still remain.

Update 2023-04-11: Reviewed use cases and posted a write-up of our current evaluation

Search Terms

regex string types

Suggestion

Open question: For people who had upvoted #6579, what use cases still need addressing?

Note: Please keep discussion on-topic; moderation will be a bit heavier to avoid off-topic tangents

Examples

(please help)

Checklist

My suggestion meets these guidelines:

AnyhowStep commented 3 years ago

Use case 1, URL path building libraries,

/*snip*/
createTestCard : f.route()
    .append("/platform")
    .appendParam(s.platform.platformId, /\d+/)
    .append("/stripe")
    .append("/test-card")
/*snip*/

These are the constraints for .append(),


Use case 2,


Use case 3, safer RegExp constructor (and similar functions?),

new(pattern: string, flags?: PatternOf</^[gimsuy]*$/>): RegExp
yume-chan commented 3 years ago

Template string type can only be used in conditional type, so it's really a "type validator", not a "type" itself. It also focuses more on manipulating strings, I think it's a different design goal from Regex-validated types.

It's doable to use conditional types to constrain parameters, for example taken from https://github.com/microsoft/TypeScript/issues/6579#issuecomment-710776922

declare function takesOnlyHex<StrT extends string> (
    hexString : Accepts<HexStringLen6, StrT> extends true ? StrT : {__err : `${StrT} is not a hex-string of length 6`}
) : void;

However I think this parttern has several issues:

  1. It's not a common pattern, and cumbersome to repeat every time.
  2. The type parameter should be inferred, but was used in a condition before it "can" be inferred, which is unintuitive.
  3. TypeScript still doesn't support partial generic inferrence (#26349) so it may be hard to use this pattern with more generic parameters.
bmix commented 3 years ago

Would this allow me to define type constraints for String to match the XML specification's Name constructs (short summary) and QNames by expressing them as regular expressions? If so, I am all for it :-)

ksabry commented 3 years ago

@AnyhowStep It isn't the cleanest, but with conditional types now allowing recursion, it seems we can accomplish these cases with template literal types: playground link

AnyhowStep commented 3 years ago

We can have compile-time regular expressions now. But anything requiring conditional types and a generic type param to check is a non-feature to me.

(Well, non-feature when I'm trying to use TypeScript for work. All personal projects have --noEmit enabled because real TS programmers execute in compile-time)

arcanis commented 3 years ago

Open question: For people who had upvoted #6579, what use cases still need addressing?

We have a strongly-typed filesystem library, where the user is expected to manipulate "clean types" like Filename or PortablePath versus literal strings (they currently obtain those types by using the as operator on literals, or calling a validator for user-provided strings):

export interface PathUtils {
  cwd(): PortablePath;

  normalize(p: PortablePath): PortablePath;
  join(...paths: Array<PortablePath | Filename>): PortablePath;
  resolve(...pathSegments: Array<PortablePath | Filename>): PortablePath;
  isAbsolute(path: PortablePath): boolean;
  relative(from: PortablePath, to: PortablePath): P;
  dirname(p: PortablePath): PortablePath;
  basename(p: PortablePath, ext?: string): Filename;
  extname(p: PortablePath): string;

  readonly sep: PortablePath;
  readonly delimiter: string;

  parse(pathString: PortablePath): ParsedPath<PortablePath>;
  format(pathObject: FormatInputPathObject<PortablePath>): PortablePath;

  contains(from: PortablePath, to: PortablePath): PortablePath | null;
}

I'm investigating template literals to remove the as syntax, but I'm not sure we'll be able to use them after all:

The overhead sounds overwhelming, and makes it likely that there are side effects that would cause problems down the road - causing further pain if we need to revert. Ideally, the solution we're looking for would leave the code above intact, we'd just declare PortablePath differently.

RyanCavanaugh commented 3 years ago

@arcanis it really sounds like you want nominal types (#202), since even if regex types existed, you'd still want the library consumer to go through the validator functions?

hanneswidrig commented 3 years ago

I have a strong use case for Regex-validated string types. AWS Lambda function names have a maximum length of 64 characters. This can be manually checked in a character counter but it's unnecessarily cumbersome given that the function name is usually composed with identifying substrings.

As an example, this function name can be partially composed with the new work done in 4.1/4.2. However there is no way to easily create a compiler error in TypeScript since the below function name will be longer than 64 characters.

type LambdaServicePrefix = 'my-application-service';
type LambdaFunctionIdentifier = 'dark-matter-upgrader-super-duper-test-function';
type LambdaFunctionName = `${LambdaServicePrefix}-${LambdaFunctionIdentifier}`;
const lambdaFunctionName: LambdaFunctionName  = 'my-application-service-dark-matter-upgrader-super-duper-test-function';

This StackOverflow Post I created was asking this very same question.

With the continued rise of TypeScript in back-end related code, statically defined data would be a likely strong use case for validating the string length or the format of the string.

johnbillion commented 3 years ago

TypeScript supports literal types, template literal types, and enums. I think a string pattern type is a natural extension that allows for non-finite value restrictions to be expressed.

I'm writing type definitions for an existing codebase. Many arguments and properties accept strings of a specific format:

fabiospampinato commented 3 years ago

I'd like to argue against @RyanCavanaugh's claim in the first post saying that:

a large number of those use cases have been addressed, but possibly some still remain.

As it stands presently TypeScript can't even work with the following type literal:

type Digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9;

type Just5Digits = `${Digit}${Digit}${Digit}${Digit}${Digit}`;

Throwing an "Expression produces a union type that is too complex to represent.(2590)" error.

That's the equivalent of the following regex:

/^\d{5}$/

Just 5 digits in a row.

Almost all useful regexes are more complicated than that, and TypeScript already gives up with that, hence I'd argue the opposite of that claim is true: a small number of use cases have been addressed and the progress with template literals has been mostly orthogonal really.

ghost commented 3 years ago

What about validation of JSON schema's patternProperties regex in TypeScript interfaces for the parsed object? This is a PERFECT application of the regex-validated string feature.

Possible syntax using a matchof keyword:

import { IJSONSchema, IJSONSchemaMap } from 'vs/base/common/jsonSchema';

export const UnscopedKeyPtn: string = '^[^\\[\\]]*$';

export type UnscopedKey = string & matchof RegExp(UnscopedKeyPtn);

export tokenColorSchema: IJSONSchema = {
    properties: {},
    patternProperties: { [UnscopedKeyPtn]: { type: 'object' } }
};

export interface ITokenColors {
    [colorId: UnscopedKey]: string;
}
sushruth commented 3 years ago

I just want to add to the need for this because template literals do not behave the way we think explicitly -

type UnionType = {
    kind: `kind_${string}`,
    one: boolean;
} | {
    kind: `kind_${string}_again`,
    two: string;
}

const union: UnionType = {
//     ~~~~~ > Error here -
/**
Type '{ kind: "type1_123"; }' is not assignable to type 'UnionType'.
  Property 'two' is missing in type '{ kind: "type1_123"; }' but required in type '{ kind: `type1_${string}_again`; two: string; }'.ts(2322)
*/
    kind: 'type1_123',
}

this shows template literals are not unique and one can be a subset of another while that is not the intention of use. Regex would let us have a $ at the end to denote end of string that would help discriminate between the constituent types of this union clearly.

ghost commented 3 years ago

(CC @Igmat) It occurs to me that there's a leaning towards using regex tests as type literals in #6579, i.e.

type CssColor = /^#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$/i;
const color: CssColor = '#000000'; // OK

It seems that regexes are usually interpreted as values by the TS compiler. When used as a type, this usually throws an error that keeps types and values as distinct as possible. What do you think of:

type CssColor = matchof /^#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$/i;
const color: CssColor = '#000000'; // OK

Editing this to note something - the RegExp.prototype.test method can accept numbers and other non-string primitives. I think that's a neat feature. If people want to strictly validate strings, they can use a intersection type with string. 😄

TL:DR; regex literal types aren't intuitively and visibly types without explicit regex->type casting, can we propose that?

Etheryte commented 3 years ago

I'm not sure what the benefit of a separate keyword is here. There doesn't seem to be a case where it could be ambiguous whether the regex is used as a type or as a value, unless I'm missing something? I think https://github.com/microsoft/TypeScript/issues/6579#issuecomment-261519733 and the replies below it already sketch out a syntax that hits the sweet spot of being both succinct and addressing all the use cases.

Regarding the intersection, the input to Regex.prototype.test is always turned into a string first, so that seems superfluous.

ghost commented 3 years ago

Good to know about RegExp.prototype.test.

The ambiguity seems straightforward to me. As we know, TypeScript is a JS superset & regex values can be used as variables.

To me, a regex literal is just not an intuitive type - it doesn't imply "string that matches this regexp restriction". It's common convention to camelcase regex literals and add a "Regex" suffix, but that variable name convention as a type looks really ugly:

export cssColorRegex: RegExp = /^#([0-9a-fA-F]{3}|[0-9a-fA-F]{4}|[0-9a-fA-F]{6}|[0-9a-fA-F]{8})$/i;
const color: cssColorRegex = '#000000'; // OK
//           ^ lc 👎 ^ two options:
//                   - A. use Regex for value clarity but type confusion or 
//                   - B. ditch Regex for unclear value name but clear type name

The original proposal does suggests JSON schemas which would use the regex as a type and a value (if implemented).

Etheryte commented 3 years ago

Perhaps I wasn't very clear, there doesn't seem to be a case where it would be ambiguous for the compiler whether a regex is a type or a value. Just as you can use string literals both as values and as types:

const foo = "literal"; // Used as a value
const bar: "literal" = foo; // Used as a type

The exact same approach can be applied for regex types without ambiguity.

ghost commented 3 years ago

My concern is that the regex means two different things in the two contexts - literal vs "returns true from RegExp.test method". The latter seems like a type system feature exclusively - it wouldn't be intuitive unless there's syntax to cast the regex into a type

ghost commented 3 years ago

There is also the issue of regex literals and regex types possibly being used as superclasses:

If all regex literals and type variables are cast into validators implicitly without a keyword, how do we use RegExp interfaces and regex literals with optional methods as a object type?

To me, context loss in https://github.com/microsoft/TypeScript/issues/41160#issuecomment-853419095 is enough reason to add a keyword, but this is another reason. I'm unsure of the name I suggested but I do prefer the use of an explicit type cast.

edazpotato commented 2 years ago

I would love this! I've had tons of issues that could be easily solved with RegEx types.

For example, a very basic IETF language tag type that accepts strings like "en-GB" or "en-US" but rejects strings that don't match the casing correctly. Using template literals (doesn't work): image How it could be done easily with RegEx types:

export type CountryCode = /^[a-z]{2}-[A-Z]{2}$/;

(I know that technically you can represent this sort of type, but it's just a simple example)

nonara commented 2 years ago

I was thinking about this a bit, while working on another PR which implements an intrinsic utility function.

I have not read through this or the previous thread very thoroughly, so forgive me if this doesn't line up with the direction of the conversation, but I'd love to hear what people think of this proposal.

I believe the heart of the issue here is having the ability to validate string and number literals. This is a slightly different take, but here is a proposal for an intrinsic utility type which could provide that functionality.

(Note: One other advantage that this has over template literals by themselves is the ability to provide custom error messages.)

Features

Example

// Utility Definition

/**
  * Actual name TBD
  * @param Regex - String or template literal for regex (in the same format as new RegExp(`<Regex>`))
  * @param Flags - Optionally, provide regex flags (as new RegExp('', '<flags>');
  * @param Constraint - Optionally, define initial constraint for value (cannot be a literal)
  * @param ErrorMessage - Optionally, define an Message to display to make the error more understandable
  */
type Validated<Regex extends string, Flags extends string = '', Constraint extends string | number = string | number, ErrorMessage extends string = never> = {
  intrinsic;
}

// ----

// Example Usage

// Intentionally simple contrivance, for demo purposes
type Email<Domain extends string = '\\S+\\.\\S+', Message = 'Invalid email address format!'> = 
  Validated<`\\S+@${Domain}`, '', string, Message>

// Validate against default email pattern + specific domain (intersection applies both validators)
type InternalEmail = Email & Email<'mycompany\\.com', 'Must be company email address!'>

// Example Implementation

let email: InternalEmail;
email = 'bad user@mycompany.com' as const; // Fails with "Validation Error: Invalid email address format!"
email = 'user@badcompany.com' as const; // Fails with "Validation Error: Must be company email address!"
email = 3; // Fails with "Validation Error: Must be string literal!" (Because Constraint doesn't match)

Notes

Please discuss, and let me know of any problems or suggestions. If people see value to this, I'll write the PR.

For the compiler folks

Initial thoughts:

// Utility produces this
interface ValidatedLiteralType {
  constraint: Type /* string | number, or one of the two */
  regex: RegExp[] /* Array of compiled regex */
  errorMessage?: StringLiteralType
}
Etheryte commented 2 years ago

The above proposal has good ideas in mind, but similar to some other discussions in this thread and the one prior, it seems to fall on the very verbose side.

type InternalEmail = Email & Email<"literal", ...>;

Comparing this to the existing literal value syntax, the additional intersection seems redundant.

type Foo = string & "literal"; // same as type Foo = "literal";

Likewise for the syntax, this comment by Ihor in the previous thread shows different use cases with the regular regex syntax which already covers both disambiguation and flags.

type CssColor = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;

Perhaps I'm missing something, currently I don't see what the generic adds over this. No other types currently support custom error messages natively out of the box (though there are workarounds you can use), so that would probably need to be a separate proposal by itself.

nonara commented 2 years ago

Thanks for the reply! There are several significant differences:

  1. Supports generating regex using template literals (see example Email generic type)
  2. Allows constraint definition for number and/or string
  3. Custom error message
  4. Provides for multiple regex patterns within a single type (from an internal compiler perspective)

These are strong distinctions, and ones which I believe have a bit of advantage over what you've mentioned. I think 1 is the most pronounced in terms of advantage.

4 is good for overall compiler performance. Not entirely sure on it, however, it could open the door for making DRY composite validated types, if you have string literals in regex format stored in separate types and you want re-use them across different validators with a single message.

Regarding 2 (constraint), for example, in the proposal:

type CssColor = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i;

What is the actual constraint of CssColor? I assume that this pattern proposal must match a string constraint.

Consider:

type ThreeDigitCode = /^\d{3}$/
// Would this work? If it did, what resolved constraint would code be? In other words, would it be treated as number? 
// Technically, if it supports string | number, it shouldn't be narrowed from assignment, so you'd be left with string | number
const code:ThreeDigitCode = 345 as const; 

As for error messages in a different proposal, I actually believe that this helps greatly improve the strength and value of the proposal. Regex validation without a discernable message is going to prove frustrating for users. Especially when they are implemented in another library or piece of code that you've not personally written.

In terms of verbosity, it doesn't seem so bad to me, and the ability to provide the extra parameters seems worth it.

type ThreeDigitCode = Validated<'^\\d{3}$', '', number>

I suppose it's also worth mentioning:

  1. If it's green-lit, I'll actually write it 😅
nonara commented 2 years ago

Comparing this to the existing literal value syntax, the additional intersection seems redundant.

I missed this comment. I used the intersection to demonstrate using multiple validators and the power of generics with template literal support. See the Example Implementation in my demo code and note:

  • Intersection means validate against all

This allows a specific error message based on which condition is violated

ghost commented 2 years ago

So one could implement an entire custom JSON schema validator in TypeScript? Interesting..

arcanis commented 2 years ago

@arcanis it really sounds like you want nominal types (#202), since even if regex types existed, you'd still want the library consumer to go through the validator functions?

Frankly, even if it only worked with literal types I'd be fine with that. We already have nominal types (of sort) by using tagged strings. Our problem is more: "how can we accept literals as input", with an optional "and validate them".

Even something as simple as:

type PortablePath = TaggedPortablePath | literal_string;

That would still be better since at least we wouldn't have to write as PortablePath everywhere we use literals (which is a lot, especially inside our tests). Of course the best would be to also validate them:

type PortablePath = TaggedPortablePath | literal_string(/^[^/]*$/);

But that is secondary compared to express types specifically targeting literals (because being a literal somewhat encodes that the user intends to pass this value, so checking is less important than arbitrary values - even if it would certainly be better to have both).

As for @nonara's proposal, it sounds like exactly what we'd need, both for literals and validation. I don't mind much about verbosity, since most of it would be abstracted in intermediary types anyway. The as const would be a bit annoying though - is it necessary? With the template string improvements in 4.3, shouldn't TS preserve the string type as static anyway?

nonara commented 2 years ago

The as const would be a bit annoying though - is it necessary?

Probably not necessary, unless anyone can provide reason for why it should be.

ghost commented 2 years ago

To me, i18n is a big reason to avoid custom error messages, at least until TypeScript adds some native consistent way to internationalise those for users of other languages.

Etheryte commented 2 years ago

@nonara Regarding constraints, personally I would expect regex validated literal types to always be strings. That's where the proposal originally started out (and why I'm following it), but that is highly subjective and I can see some arguments for the other side too.

The reason why I personally feel this way is the following. Natively, Javascript only supports regex on strings. That can be worked around in one way or another if you'd like, but since Javascript is the underlying language for Typescript, matching its intuition can lower the number of foot-guns.

In addition to that, adding regex support for numbers creates a considerable amount of ambiguity that simply didn't exist before. A good example is non-decimal bases. Is const foo: NumberLiteral</\d{3}/> = 0b1111; valid? Would it be possible to only allow hex literals in a context where that makes sense? Or do you want to match whatever the number evaluates to instead? Likewise for floating-point errors, would you expect const foo: NumberLiteral</0.3/> = 0.1 + 0.2; to be an error or not?

Without taking a side on any of those questions, I hope you can see that numbers require far more consideration than boring old strings in this regard. Regex on strings is already hard problem, but at least it's a fairly well-known problem, and that's why I'd prefer to have that in type checking.

nonara commented 2 years ago

i18n is a big reason to avoid custom error messages

I hear you. However, something to consider. In the event it fails:

Without custom message: Validation failed for YourType (in your language)

With custom message: Validation failed for YourType: (in your language) + <Custom message> (single language)

In these scenarios, you lose nothing with the latter, as i18n translation is provided for base message. You do, however, gain some information. With respect, that argument is like prescribing not adding JSDoc documentation or comments due to lack of i18n. It's better to have information which may be marginally less than ideal in some scenarios than none at all.

If the proposal entirely replaced the base message, I'd agree, but given that it simply adds information, I don't see this being a negative.

Beyond that, i18n would actually still be possible if setup properly.

ghost commented 2 years ago

Don't get me wrong, I actually like the syntax and I feel that it's more TypeScript-ish than the current proposal, which still confuses me to some degree.

nonara commented 2 years ago

I would expect regex validated literal types to always be strings

It's certainly true that regex processes strings. It's also true, though, that JavaScript does handle matching on numbers automatically by coercion. The method we'd be using internally is RegEx#test, which accepts number values.

In this case, I'm not married to it, but I actually think there is greater value in supporting numbers just as the test method does. I'll explain after addressing your questions.

Is const foo: NumberLiteral</\d{3}/> = 0b1111; valid?

No, because 0b1111 evaluates to 15, and thus it fails for length. TypeScript, however treats 0b1111 as const as the literal 15, so it would be handled as such.

/\d{2}/.test(0b1111) passes.

Would you expect const foo: NumberLiteral</0.3/> = 0.1 + 0.2; to be an error or not?

Concerning simply javascript's test method:

However, in this case, this is where these fears are assuaged.

Remember that we're dealing with non-calculated, hard coded numbers, that exist in the type system. The type system currently affords no way to perform math on numeric literals.

(3.1 + 4.1) as const produces an error.

Consequently, the example you provided (const foo: Validated<'0.3'> = 0.1 + 0.2; // format corrected) would produce an error saying that it requires a number literal, because a literal type cannot be derived from a BinaryExpression (the 0.1 + 0.2 clause), so it would simply be the same as saying const foo: Validated<'0.3'> = <number>0.3;

For that reason, while concerns over floating point issues are well founded and generally very valid, especially with respect to regex, they're not a factor in this case. Regarding any other concerns, I would suggest that we simply add to the documentation that it functions the same way that RegEx#test does, coercing number literals to strings so they can be validated.

Any possible issue which may arise would really simply be dealing with understanding how regex works.

In contrast, consider how often a user is going to file an issue saying Why can't I validate a number??! It works with RegEx#test!.

As I mentioned, I'm not married to it, but there are a number of valid use cases, where it would be practical. This is especially true due to the fact that we've added error message to the proposal.

Consider this example:

/** 
  * In this example, Validation happens both on the language server and during runtime, which supports pure JS 
  * environments in addition to offering faster diagnostics with TS environment.
  *
  * Note: we also have a DRY, single definition for regex pattern and error message that is consumed by both the 
  * type system and runtime validators!
  */

class ThreadError extends Error {
  static validation = '^[1-9]$' as const;
  static message = 'You must have between 1-9 threads!' as const;
  constructor() {
    super(ThreadError.message);
  }
}

type ThreadCount = Validated<typeof ThreadError['validation'], '', number, typeof ThreadError['message']>

class Downloader {
  constructor(public threads: ThreadCount) {
    if (!Downloader.threadsValidator.test(threads)) throw new ThreadError();
  }

  static threadsValidator = new RegExp(validation);
}

In cases like these, you really don't want to require the threads parameter to be a string, and there is no way to cast a number literal to a string, so threads could not be validated in this case, which I think is an unnecessary drawback.

nonara commented 2 years ago

Don't get me wrong, I actually like the syntax and I feel that it's more TypeScript-ish than the current proposal

Not at all! I appreciate the dialog.

I've checked back on this thread over the past years now and then to see where things are at. What little I've seen seems to have simply stagnated at design questions. I'd love to get all of the questions out and answered so we can actually get it done. This problem is solvable, and my perspective at this point is that this proposal has the angles are covered, but I still need any questions and gotchas to make sure any angle I've not seen is covered.

Otherwise, an indication of interest or 'meh' is good too. This seems valuable to me, especially with the messages, but n=1 isn't justifiable.

If anyone is interested in actually pushing this through, look closely, and hit me with anything you think will be a problem. Once all are answered, if a TS team member approves of the decisions (at least far enough to warrant reviewing a PR), I'll build it. The actual implementation, at this point, won't be too difficult, however there may be some added complexity if some parsing and validating of the regex string is required, but likely shouldn't be too bad.

I recall Ryan mention back-referencing being an issue, which is something I'd like to hear more about and discuss. If we need to cover for it, we can certainly cause it to pre-parse and fail under unacceptable circumstances.

ecyrbe commented 2 years ago

Hello,

Currently, template literal strings have a limitation to match simple use cases. It seems that @RyanCavanaugh have concerns about backtraking with regex.

So what about adding some new template literal intrinsics like we currently have with Capitalize :

Here are some that could solve explicited issues :

type HEX = 
  "0" 
  | "1"
  | "2"
  | "3"
  | "4"
  | "5"
  | "6"
  | "7"
  | "8"
  | "9"
  | "a"
  | "b"
  | "c"
  | "d"
  | "e"
  | "f"
  | "A"
  | "B"
  | "C"
  | "D"
  | "E"
  | "F";

type CssColor = `#${Repeat<HEX,4> | Repeat<HEX,6>}`; // Repeat is an intrinsic that does not expand into combinatorial complexity because it internally creates a new type for the repeated pattern. same as /^#([0-9a-fA-F]{4})|([0-9a-fA-F]{6})$/

const color: CssColor = "#Affa35'";

type Byte64 = `x${RepeatRange<HEX,1,16>}`; // same as /^x[0-9a-fA-F]{1,16}$/

const machineWord : BYTE64 = 'xFFFF0000FFFF0000';
const canbesmaller: Byte64 = 'xF0';

// in fact Repeat and RepeatRange don't need to be intrinsics, they just need to be implemented with a string literal type erasure
// here is an exemple

type StringNumbers6 = `${number}-${number}-${number}-${number}-${number}-${number}`;
// no exponential complexity !!! why ? 
// because number is just `one` type, not a union like 0 | 1 | 2 ... | 9007199254740991
const test: StringNumbers6 = '10-11-12-13-14-15'; 

// so what if we could create a type erasure from our custom types ?
// meaning, transform '0' | ... | '9' | 'a' | ... | 'z' | 'A' | ... | 'Z' into an opaque type like `number` ?  

// Here is a proposal for a type erasure to limit exponential complexity :

type Repeat4<Pattern> = `${TypeMerge<Pattern>}${TypeMerge<Pattern>}${TypeMerge<Pattern>}${TypeMerge<Pattern>}`;
type Repeat6<Pattern> = `${TypeMerge<Pattern>}${TypeMerge<Pattern>}${TypeMerge<Pattern>}${TypeMerge<Pattern>}${TypeMerge<Pattern>}${TypeMerge<Pattern>}`;

type CssColor = `#${Repeat4<HEX>|Repeat6<HEX>}`;

// even better we could erase type at declaration : 
type HEX = TypeMerge<
  "0" 
  | "1"
  | "2"
  | "3"
  | "4"
  | "5"
  | "6"
  | "7"
  | "8"
  | "9"
  | "a"
  | "b"
  | "c"
  | "d"
  | "e"
  | "f"
  | "A"
  | "B"
  | "C"
  | "D"
  | "E"
  | "F">;

type Repeat4<Pattern> = `${HEX}${HEX}${HEX}${HEX}`;
type Repeat6<Pattern> = `${HEX}${HEX}${HEX}${HEX}${HEX}${HEX}`;
type CssColor = `#${Repeat4<HEX>|Repeat6<HEX>}`;
Griffork commented 2 years ago

I would very much like the ability to use regex to validate types for a range of different types from css colour names, UUIDs, rgb(a) colour hash values, max 255 character limit strings, etc. I need to be able to explicitly type a generic string to it and have implicit assignments be an error but I do not need (but would use if available) the ability to type narrow in an if statement. At the moment I'm casting strings to any and then to a declaration-only class just to make sure that I can't accidentally assign to the variable by accident. I don't like this hack and I'd really like to replace it with a system that actually describes what the variable is.

My 2c:

castarco commented 2 years ago

I recall Ryan mention back-referencing being an issue, which is something I'd like to hear more about and discuss. If we need to cover for it, we can certainly cause it to pre-parse and fail under unacceptable circumstances.

I wonder if we could start with something simpler (without having to provide support to the complete regexp features set), actually most use cases that arise in real-life problems don't go much further than requiring +, *, ?, "start" ^, "end" $, groups ((|)) and character classes (maybe with character class negations).

I believe that adding back-references support is overshooting, and complicates the problem too much (same for lookaheads & lookbehinds). I fail to see clear benefits attached to that hypothetical effort (at best, they would be marginal if we compare them to what basic regexps would provide).

I'm not against the idea, but I think it shouldn't be a reason to slow down the development of a more basic feature (that everybody would love anyway, it would be awesome even if we had incomplete support, it's not like it is easy to find this feature in other languages, it would be probably a "first").

nonara commented 2 years ago

@castarco fwiw, it's really not a complexity issue. We wouldn't need to write a regex engine. We'd be able to rely on the existing node regex engine in the compiler. If there's an issue with backreferencing, we'd just need a simple parser which would add a diagnostic error under whatever circumstances were deemed unacceptable.

Overall, there wouldn't much to implementing it. That said, it didn't seem like the interest was really there. Maybe someday!

Shinigami92 commented 2 years ago

it didn't seem like the interest was really there

What do you mean? This is literally already a newly created issue, because the old one was so long that you could scroll to hell :eyes:

nonara commented 2 years ago

@Shinigami92 My understanding is that this thread was created primarily because the team thought the core issues addressed in the original thread had been resolved (I believe due to adding the template literal type). However, they opened this thread as a side channel for anyone remaining who thought there was still value in continuing the discussion.

That said, I came here, made a proposal, and 'took the temperature', while making it clear if there was sufficient interest, I'd write it. I asked for feedback as to problems with the approach or indication of interest in pushing forward. There were no responses to my request for indication of interest. Over the months, it seems a few thumbs up accumulated, but even if those were to be considered a strong interest, the number is pretty low.

That in mind, I think it's reasonable to infer that there isn't a tremendous interest in it — which I agree is a shame. Maybe that's not true, and maybe it'd do better if the proposal was in a different thread. If there ever seems to be interest, I might look into it again! Hope that helps clear up what I meant! I don't want to spam the thread, so I'm going to bow out.

Etheryte commented 2 years ago

@nonara Personally I find the solution you proposed to be sufficiently different from the wider consensus in the original threads, so I wouldn't really conflate the popularity of the two ideas. Not to say that it's a bad approach, simply that it's very different to what most people have voiced their support for with their reactions.

nonara commented 2 years ago

@Etheryte I hear you. I think, in part, I expressed it poorly by leading with the complexity.

At the core it's:

// Original
type A = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i

// My proposal
type B = Validated<'^#([0-9a-f]{3}|[0-9a-f]{6})$', 'i'>

The two statements, syntactically, have very little difference and are functionally equal. My proposal just adds ability to it in a way that I felt strengthened the proposal by broadening the use cases and solving multiple issues. I should have just led with the simple comparison.

What's really important though is why I chose the utility function.

Adding syntax to the language is more complex and much less likely to get approval. By going the route of a utility function, I believe it also affords a much higher likelihood of being merged. I began there and realized that because it was a utility function, it could have the added bonuses, solving the original problem more broadly and also addressing another major request for custom error messages.

shaedrich commented 2 years ago

Over the months, it seems a few thumbs up accumulated, but even if those were to be considered a strong interest, the number is pretty low. [...] That in mind, I think it's reasonable to infer that there isn't a tremendous interest in it — which I agree is a shame. Maybe that's not true [...] If there ever seems to be interest, I might look into it again! --@nonara

@nonara I'm following this thread for quite some time. Thanks for sharing your suggestions with us. I agree, that such a type would be tremendously useful, yet it's rather underwhelming that the community is so divided about this topic and can hardly agree on anything addressing this.

castarco commented 2 years ago

@nonara Regarding "intrinsic utility functions" ... maybe we could go "wilder", and define "const-evaluable" functions that allowed us to impose constraints on any kind of literal value to define narrower types (not just regexp for strings, but arbitrary constraints applied to other literal values, like numbers).

The same way regexps could be easily used inside type definitions because it's difficult to misinterpret their meaning... we could do the same for functions with a specific "shape" ( ~literal~ ~primitive~ (number | string) -> bool ).

I imagine something like...

// Only "pure" functions without refs to external non-constant values would be accepted
type Even = (n: number) => (n % 2 == 0)
type Email = (s: string) => bool(/gigantic_email_regexp/.test(s))

// Probably better using your idea of having a specific utility type:
type Even = Validated<(n: number) => (n % 2 == 0)>
type Email = Validated<(s: string) => /gigantic_email_regexp/.test(s)>

// EDIT 2: A third option probably would be better
// ------------------------------------------------------------------

// First, to have independently defined functions that match a very strict signature
// (string|number) => bool
const isEven = (n: number) => 0 == n % 2 // So we can reuse it later at runtime
type Even = FromTypeGuard<isEven>

// The previous type definition would make `isEven` to be treated as if it was defined like as a
// a user-defined type guard (as of today, we can't do that before because `Even` is not defined
// prior to the function). So isEven would be interpreted as if this was its definition:
function isEven(n: number): n is Even {
  return 0 == n % 2
}

I guess it's evident, but just in case, my idea would be that returning true means that the value belongs to the type, and returning false the opposite.

Actually, although it's not that "powerful" when it comes to combining string templates & regexps... it would be (probably) simpler to understand and learn.

Regarding the "const-evaluable" functions, I'm not sure if Typescript already relies on this concept. I know that we can define type guard functions, but I'm not sure if Typescript imposes any constraint on the functions to be evaluable at "compile-time". Sadly, we can't easily mark such functions as we can do in other languages like Rust, although I don't think this would pose a big problem.

EDIT: I think this more generic approach would play well with how TS works. In a way, it feels like nice a generalization of structural typing, or deepening into it, by going from high-level structures' definitions into non-trivial primitive values' internal structures.

Shinigami92 commented 2 years ago

Most to all provided ideas / examples would introduce breaking changes or at least breaking design changes into TS.

type ColorHex = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i

It's not possible to use complex types directly. We would expect const c: ColorHex = '#000' to be valid. But why would const c: ColorHex = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i not be valid?! That's a no-go.


type Even = (n: number) => (n % 2 == 0)

This would be an issue due to why is it not a function callback type. It is to near on something like type Even = (n: number) => () and this could theoretically something like type Even = (n: number) => void


A special type like Validated would also not work that why, because why? Why would suddenly one type be so special to take regex as generic argument? And wont be the generic argument the type itself? So at least we would need a keyword instead of a type.


So I would like to see something like

type ColorHex = pattern(/^#([0-9a-f]{3}|[0-9a-f]{6})$/i)
// or
type ColorHex = const(/^#([0-9a-f]{3}|[0-9a-f]{6})$/i)

type Even = pattern((n: number) => (n % 2 == 0))
type Even = const((n: number) => (n % 2 == 0))

Please also keep in mind that this would ONLY check compile-time and not runtime! For runtime we already have type guards!

Please correct me if I still see something wrong :thinking:

ghost commented 2 years ago

Hmm agree with the logic there, also I suggested matchof in https://github.com/microsoft/TypeScript/issues/41160#issuecomment-853317957 without any ()/<> notation

nonara commented 2 years ago

Thanks for the conversation, all! It's great to see so many analytical minds working through the problem.

The field of compilers, language semantics, and type systems is certainly fascinating! It is very specialized, but it is worth learning. I highly recommend anyone interested in this sort of thing to study further! The TS compiler is a great place to learn.

That said, I think it's a good point for me to unsubscribe from the thread. With absolutely no disrespect intended, it's worth noting that most of the concerns that are being raised (eg. floating point arithmetic) lack the foundational understanding of the field and/or the compiler's architecture.

Again, that's not meant to be an insult. It took me several years of working in the compiler to feel comfortable adding to it, much less to feel I had even a somewhat reasonable grasp on the subject. Even now, I'm sure I'm very far from where the core team is at, but it's a process and we learn as we work!

I'll address a few last things before "signing off".

@castarco

Regarding "intrinsic utility functions" ... maybe we could go "wilder", and define "const-evaluable" functions

Interesting idea! I think overloading the function signature would be a difficult sell, but it's a cool thought.

@Shinigami92

Most to all provided ideas / examples would introduce breaking changes or at least breaking design changes into TS.

Neither proposal introduces a change which would break or alter previous behaviour. As it is a new feature, however, any implementation can be expected not to work on previous versions of TS.

It's not possible to use complex types directly. We would expect const c: ColorHex = '#000' to be valid. But why would const c: ColorHex = /^#([0-9a-f]{3}|[0-9a-f]{6})$/i not be valid?! That's a no-go.

That would not be an issue for the compiler.

To answer your question, it would not be valid because '#000' is a StringLiteral and /^#([0-9a-f]{3}|[0-9a-f]{6})$/i is a RegularExpressionLiteral. We would be expecting a StringLiteral, so the compiler would produce a diagnostic error in that case.

A special type like Validated would also not work that why, because why? Why would suddenly one type be so special to take regex as generic argument?

We wouldn't be taking regex as an argument in my proposal, though it wouldn't be a major hurdle if we were.

The syntax is Validated<StringLiteral, StringLiteral>, where the regex would be in a string.

And wont be the generic argument the type itself?

The purpose of a utility function is to produce a final Type. In this case, it would produce a new type, which I detailed a bit in my proposal. The specifics of that and the mechanism would require knowledge on the compiler's codebase, so I won't get into it, but it wouldn't be anything novel.

Suffice it to say, I've already written a much more involved and complicated intrinsic utility function, so I can assure that the proposed method would work.


With that, thanks again to all for the discussion!

I think in the future, for any additions to the compiler, I'll just write the PR and open up the floor for discussion after. That way people can see it working first, and I can directly engage with the team on any changes needed. I don't know if I'll ever do this one, but I do have a pretty major one planned that I'm very excited about. Hoping I can get it done sometime next year, but that will depend on if I can free up a couple months to devote to it.

Take care!

mcdanieladamg commented 2 years ago

Can't we just scale back the whole concept of Validated<T> to just be some string (or any class) that doesn't automatically upcast to the parent class (i.e. string) when Typescript compiles it? Something like:

type StringifiedJsonObject = Validated<string>; //Special new behavior, so this will not automatically allow cast between other types which were also declared as Validated<string>, only types declared as StringifiedJsonObject or Validated<StringifiedJsonObject> because Validated<T> automatically has a parent type of T. .. function myCoolFunction(a: StringifiedJsonObject) { return JSON.parse(a); } .. myCoolFunction("Hello"); //Compile Error: cannot cast type string to Validated<string> .. function someFunctionAnywhereInCodeToCheckValidated(a: string): a is StringifiedJsonObject { return true; } //(regex here) .. let myStr: string = "{}"; if (someFunctionAnywhereInCodeToCheckValidated(myStr)) {   myCoolFunction(myStr); //No error because myStr is now StringifiedJsonObject   console.log("My string was: " + myStr); //Also casts to a string (or any base type indicated in Validated<TBaseClass>) }

That way you just rely on the developer to do the validation (I.E. Offer validation functions somewhere for the new type) using regex or whatever validation they want, and it allows you to dynamically create a new subclass type (extending T) that wont just be upcast automatically by the compiler (because of no compiler-visible field differences).

Btw as a side note: T should in practice be only immutable classes, like a string is, or the "validated" state means nothing. Let me know if I should propose this as a separate feature suggestion.

yume-chan commented 2 years ago

@mcdanieladamg Already possible. See #4895

landon-at-faro commented 2 years ago

Would be slick if I could have this in my API:

type PathVariablesOf<Path extends string> = {
  [PathVariable in MatchesOf<Path, /:(w+)/g>]: string;
}

function getPathVariables<T extends string>(path: T, locationPath: string): PathVariablesOf<T> {
  // ...
}

const pathVariables = getPathVariables('/path/:a/variables/:b', '...');
// return type is : {a: string, b: string}
Mutefish0 commented 2 years ago

Here is my workaround: https://github.com/microsoft/TypeScript/issues/6579#issuecomment-1133326628, but there is a limitation as fabiospampinato point out https://github.com/microsoft/TypeScript/issues/41160#issuecomment-831846373

tarwin commented 2 years ago

I was trying to make a type that reflected MySQL datetime string values ie "2022-07-31 23:11:54".

Interestingly, you can almost do it currently, but if you add any more specificity it will end up either being any or complain that it can't add more typing. I think there is limit to the # of typings it can create?

If there was a RegExp way of doing that'd be nice - but really this is a kind of silly ask of the language maybe? I also don't know how it would work with the internals.

type OneToNine = 1|2|3|4|5|6|7|8|9
type ZeroToNine = 0|1|2|3|4|5|6|7|8|9

type DateTimeType = `${
  `${number}`
}-${
  `0${OneToNine}` | `1${0|1|2}`
}-${
  `0${OneToNine}` | `1${ZeroToNine}` | `2${ZeroToNine}` | `3${0|1}`
} ${
  `0${OneToNine}` | `1${0|OneToNine}` | `2${0|1|2|3}`
}:${number}:${number}`