microsoft / TypeScript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
https://www.typescriptlang.org
Apache License 2.0
100.23k stars 12.39k forks source link

Type named capture groups better #32098

Open Jamesernator opened 5 years ago

Jamesernator commented 5 years ago

Search Terms

named regexp, named capture groups

Motivation

Currently named capture groups are a bit of a pain in TypeScript:

  1. All names are of type string even if the regexp doesn't have that named group.
  2. You need to use non-null assertion for .groups even when that is the only possibility.

Suggestion

I propose making RegExp higher order on its named capture groups so that .groups is well typed.

// Would have type: RegExp<{ year: string, month: string }>
const date = /(?<year>[0-9]{4})-(?<month>[0-9]{2})/

const match = someString.match(date)
if (match) {
  // match.groups type would be { year: string, month: string }
  // currently is undefined | { [key: string]: string }
}

// Would have type RegExp<{ year: string, month?: string }>
const optionalMonth = /(?<year>[0-9]{4})(-(?<month>[0-9]{2}))?/

Checklist

My suggestion meets these guidelines:

fatcerberus commented 5 years ago

I could have sworn there was already another ticket discussing this exact issue at length (there were a lot of tradeoffs etc.), but I can’t find it now.

RyanCavanaugh commented 5 years ago

There may have been one for group arity?

weswigham commented 5 years ago

It would be kinda neat with this to make types like RegExp</(?<year>[0-9]{4})-(?<month>[0-9]{2})/, { year: /[0-9]{4}/, month: /[0-9]{2}/ }>.

fatcerberus commented 5 years ago

I propose making RegExp higher order on its named capture groups so that .groups is well typed.

This sounds like dependent typing to me, which is a rather large can of worms to open.

Jamesernator commented 5 years ago

This sounds like dependent typing to me, which is a rather large can of worms to open.

I don't think so. I think it would be as simple as something like:

interface Match<G extends { [key: string]: string } | undefined> {
  // Note that G is only undefined when the RegExp has no named capture groups
  groups: G,
}

interface RegExp<G extends { [key: string]: string } | undefined = undefined> {
  exec(s: string): null | Match<G>,
}

// and so on for String .match/.matchAll/etc

Note that all of the type information is already encoded in the regular expression literal /(?<foo>[0-9]+)/ implies that .groups is { foo: string } if it the match actually exists.

EDIT: Fixed code.

nmain commented 5 years ago

Might be nice to add the same stronger typing for numbered groups at the same time, so that

"foo".match(/(f)(oo)/)

Would only have valid indexers [0], [1], and [2]

felixfbecker commented 5 years ago

There is an ESLint rules that enforces the use of named capture groups to avoid bugs & improve readability: https://eslint.org/docs/rules/prefer-named-capture-group

Paired with this feature it would be amazing

Jessidhia commented 5 years ago

These helper methods kinda smell "any-ish" because of how their generic is used but they should be safer and easier to use than trying to directly read from .groups or having to write your own argument handling for String#replace. I'm not sure TypeScript itself will ever be able to have a proper type for String#replace because of how... extremely variadic it is; it's probably not impossible with slice types, but it'd require two generics on every RegExp to be able to know how many capturing groups there are in total (both named and unnamed).

/**
 * Wrapper for functions to be given to `String.prototype.replace`, to make working
 * with named captures easier and more type-safe.
 *
 * @template T the capturing groups expected from the regexp. `string` keys are named,
 *   `number` keys are ordered captures. Note that named captures occupy their place
 *   in the capture order.
 * @param replacer The function to be wrapped. The first argument will have the
 *   shape of `T`, and its result will be forwarded to `String.prototype.replace`.
 */
export function named<T extends Partial<Record<string | number, string>> = {}>(
  replacer: (
    captures: { 0: string } & T,
    index: number,
    original: string
  ) => string
) {
  const namedCapturesWrapper: (match: string, ...rest: any[]) => string = (
    ...args
  ) => {
    const { length } = args
    const named: string | Partial<Record<string, string>> = args[length - 1]
    const captures: { 0: string } & T = Object.create(null)
    if (typeof named === "string") {
      // the regexp used does not use named captures at all
      args.slice(0, -2).forEach((value, index) => {
        Object.defineProperty(captures, index, {
          configurable: true,
          writable: true,
          value
        })
      })
      return replacer(captures, args[length - 2], named)
    }
    // the regexp has named captures; copy named own properties to captures,
    // then copy the numeric matches.
    Object.assign(captures, named)
    args.slice(0, -3).forEach((value, index) => {
      if (index in captures) {
        throw new RangeError(
          `Numeric name ${index} used as a regexp capture name`
        )
      }
      Object.defineProperty(captures, index, {
        configurable: true,
        writable: true,
        value
      })
    })
    return replacer(captures, args[length - 3], args[length - 2])
  }
  return namedCapturesWrapper
}

// the first overload is here to preserve refinements if `null` was already
// checked for and excluded from the type of exec/match result.
/**
 * Helper to extract the named capturing groups from the result of
 * `RegExp.prototype.exec` or `String.prototype.match`.
 *
 * @template T type definition for the available capturing groups
 * @param result the result of `RegExp.prototype.exec` or `String.prototype.match`
 * @returns the contents of the `.groups` property but typed as `T`
 * @throws if `.groups` is `undefined`; this only happens on regexps without captures
 */
export function groups<T extends Partial<Record<string, string>> = {}>(
  result: RegExpMatchArray | RegExpExecArray
): T
/**
 * Helper to extract the named capturing groups from the result of
 * `RegExp.prototype.exec` or `String.prototype.match`.
 *
 * @template T type definition for the available capturing groups
 * @param result the result of `RegExp.prototype.exec` or `String.prototype.match`
 * @returns the contents of the `.groups` property but typed as `T`, or `null` if
 *   there was no match
 * @throws if `.groups` is `undefined`; this only happens on regexps without captures
 */
export function groups<T extends Partial<Record<string, string>> = {}>(
  result: RegExpMatchArray | RegExpExecArray | null
): T | null
/**
 * Helper to extract the named capturing groups from the result of
 * `RegExp.prototype.exec` or `String.prototype.match`.
 *
 * @template T type definition for the available capturing groups
 * @param result the result of `RegExp.prototype.exec` or `String.prototype.match`
 * @returns the contents of the `.groups` property but typed as `T`, or `null` if
 *   there was no match
 * @throws if `.groups` is `undefined`; this only happens on regexps without captures
 */
export function groups<T extends Partial<Record<string, string>> = {}>(
  result: RegExpMatchArray | RegExpExecArray | null
): T | null {
  if (result === null) {
    return null
  }
  if (result.groups === undefined) {
    throw new RangeError(
      "Attempted to read the named captures of a Regexp without named captures"
    )
  }
  return result.groups as T
}

image

image

There might be no need to copy the numeric captures, though; I just made them be copied because it seemed to make sense to put the matched substring in 0 instead of moving to a separate argument.

Bessonov commented 4 years ago

I've overall problem with RexExp definition and definition of objects and arrays. From my point of view, allowing something like:

const x: {[x: string]: string} = {}
const y = x['foo'] // <= y is a string here
console.log(y.length)
> Uncaught TypeError: Cannot read property 'length' of undefined

Same for arrays, but, well, this one is very surprising:

const a: string[] = []
const b = a[0] // <= string - why why why?
const c = a.pop() // <= string | undefined

// and other way:

const a: [string] = ['foo']
const b = a[0] // <= string
const c = a.pop() // <= string | undefined - why why why? TS can infer from `if`, but not here?

is a big misconception in sake of convenience. This leads the one of greatest type system ad absurdum. But I'm sure, the core team has another opinion on that, unfortunately.

Based on above statements the definition for RegExpMatch* isn't helpful:

interface RegExpMatchArray {
    groups?: {
        [key: string]: string
    }
}

interface RegExpExecArray {
    groups?: {
        [key: string]: string
    }
}

Infer types from regular expression is possible (from my point of view), but very complex. Instead of that I would like to see more developer support to make it type safe (pseudo code):

type RegExpMatch = {
  [key: number]: string | undefined,
  groups?: {
    [key: string]: string | undefined
  }
}

interface RegExp<T extends RegExpMatch> {
   exec(string: string): T | null;
}

To make it more type safe:

const regexp = new RegExp<{0: string, {groups: {foo: string}}}>('/^\/(?<foo[^/]+)$/')
const result = regexp.exec('/bar')
if (result !== null) {
  // now you get the typings here
  result[0] // <= string
  result[1] // <= string | undefined (or may be never?)
  result.groups.foo // <= string
  result.groups.test // <= string | undefined (or may be never?)
}

If developer makes a mistake in typings, well, that's OK. But better as allow everything.

A little bit related: https://github.com/Microsoft/TypeScript/issues/6579

dolsem commented 4 years ago

I think it'd be great to implement this alongside #38671, so that generic regexes keep their current typing, but regex literals have strongly typed capturing groups.

const re1 = /(?<year>[0-9]{4})-(?<month>[0-9]{2})/;
type Groups1 = ReturnType<typeof re1.exec>['groups']; // Remains Record<string, string>

const re2 = /(?<year>[0-9]{4})-(?<month>[0-9]{2})/ as const;
type Groups2 = ReturnType<typeof re2.exec>['groups']; // Would be { year: string, month: string }

And generalize them so that:

type hasYearAndMonth<T extends Regex> = T extends Regex<'year'|'month'> ? true : false;
const re1 = /(?<year>[0-9]{4})/ as const;
const re2 = /(?<year>[0-9]{4})-(?<month>[0-9]{2})/ as const;
type T1 = hasYearAndMonth<typeof re1>; // false
type T2 = hasYearAndMonth<typeof re2>; // true
acutmore commented 3 years ago

I really like the idea of extracting the named group static type information from RegExp literals. I'm curious how people imagine this 'metadata' would be associated with the RegExp literal before it's passed to match? As the RegExp instance itself has no structure that reveals its groups. Is there already an existing pattern in TypeScript's core type definitions for this?

interface RegExpWithGroups<G extends { [name: string]: string }> extends RegExp {
    __secret_groups_metadata__: G // don't actually try and access me this is 'hidden' type-only Metadata
}

const reg: /(?<FirstFour>.{4})(?<NextFour>.{4})/ as const;
//    ^^^ RegExpWithGroups<{ FirstFour: string, NextFour: string }>

EDIT: ah ignore me. I re-read the thread properly and realized we don't need to expose it.

webstrand commented 3 years ago

The RegExp could be represented as a literal in the type system, i.e.

function route(re: /(?<FirstFour>.{4})(?<NextFour>.{4})?/s) {
    re.dotAll // true
    const match = str.match(re);
    if(match === null) return;

    match[0] // string

    match.groups.FirstFour // string
    match[1] // string

    match.groups.NextFour // string | undefined
    match[2] // string | undefined
}

That'd be a lot less verbose than Regex<{ FirstFour: string, NextFour: string | undefined }. And it'd be possible (though I don't know if it'd be efficient) to track dependencies between capture groups, for example: /(?<A>foo(?<B>bar))|(?<C>baz)/ would have the effective type Regex<{ A: string, B: string } | { A: undefined, B: undefined, C: string }>.

Thankfully JavaScript doesn't have branch reset groups.

hlovdal commented 3 years ago

For everyone that wants to have type safety and auto completion on the groups part right now you can declare the variable the match result is stored in as the RegExpMatchArrayWithGroups type below like the following:

const output = 'hello_world.ts:13412:Missing ;.';
const m: RegExpMatchArrayWithGroups<{file: string, line: string; error: string}>
    = output.match(/^(?<file>[^:]+):(?<line>[^:]+):(?<error>.*)/);
if (m && m.groups) {
    // f: "hello_world.ts", l: "13412", e: "Missing ;."
    console.log('f: "' + m.groups.file + '", l: "' + m.groups.line + '", e: "' + m.groups.error + '"');

    // console.log(m.groups.filename); 
    // Property 'filename' does not exist on type '{ file: string; line: string; error: string; }'
}

The RegExpMatchArrayWithGroups type needs a type argument where all the group names in the regex are duplicated, so there is no automatic parsing from it, but when things are in the same statement this should be quite maintainable.

Definitions:

type RegExpMatchArrayWithGroupsOnly<T> = {
    groups?: {
        // eslint-disable-next-line no-unused-vars
        [key in keyof T]: string;
    }
}
type RegExpMatchArrayWithGroups<T> = (RegExpMatchArray & RegExpMatchArrayWithGroupsOnly<T>) | null;
Pyrolistical commented 2 years ago

Thanks @hlovdal. I took your idea made it easier to use.

type RegExpGroups<T extends string[]> =
  | (RegExpMatchArray & {
      groups?:
        | {
            [name in T[number]]: string;
          }
        | {
            [key: string]: string;
          };
    })
  | null;

const output = "hello_world.ts:13412:Missing ;.";
const match: RegExpGroups<["file", "line", "error"]> = output.match(
  /^(?<file>[^:]+):(?<line>[^:]+):(?<error>.*)/
);
if (match) {
  const { file, line, error } = match.groups!;
  console.log({ file, line, error });
}

Playground

acnebs commented 1 year ago

I personally prefer just using a Union type versus using an array of strings (saves having to type square brackets if you only have one group), so mine looks like this:

export type RegExpGroups<T extends string> =
  | (RegExpMatchArray & {
      groups?: { [name in T]: string } | { [key: string]: string };
    })
  | null;

Usage: const match: RegExpGroups<'file' | 'line' | 'error'> = ...

wmertens commented 1 year ago

Would it not be possible for TS to fully grok regexes instead of having to type them manually?

If there's a named group, groups will exist and have the name as a member, and if the match is optional it will be | undefined.

Since regexes are an integral part of JS, kinda makes sense?

thesoftwarephilosopher commented 1 month ago

VS Code already highlights the group names inside a regex, so surely it's not hard to tokenize them inside TS and attach the names as part of the regex's type?

const r = /foo(?<qux>bar)/;

// current inference
r as RegExp

// can't it be inferred as this?
r as RegExp<{ qux: string }>
thesoftwarephilosopher commented 1 month ago

I'd be glad to put effort towards that feature if there's not already work being done towards it, as long as someone can help me figure out where to get started in the codebase. (I've dug into it before, but that was like 8 years ago.)

bgmort commented 3 weeks ago

An iteration on the ideas from @hlovdal and others above, to be able apply the type directly to the regex:

type RegExpMatchWithGroups<T extends string> = null | (Omit<RegExpExecArray, 'groups'> & { groups: { [name in T]: string | undefined } })

type RegExpWithGroups<T extends string> = Omit<RegExp, 'exec'> & {
  exec(str: string): RegExpMatchWithGroups<T> | null
}

const lineMatcher = /^(?<file>[^:]+):(?<line>[^:]+):(?<error>.*)/ as RegExpWithGroups<'file' | 'line' | 'error'>

const match = lineMatcher.exec('hello_world.ts:13412:Missing ;.')
if (match) {
  const {file, line, error} = match.groups
  console.log(file, line, error)
}

Playground

thesoftwarephilosopher commented 3 weeks ago

I've cloned TypeScript and am starting to look into how I might imlpement this.

Any clues to where I should start looking would be appreciated to help speed it up.

In the meantime, here's the simple function I'm using in imlib:

// helper
function matcher<T extends string>(regex: RegExp) {
  return (str: string) => {
    return str.match(regex)?.groups as { [key in T]: string };
  }
}

// examples
const isArrayFile = matcher<'ext' | 'slug'>(/\/.*(?<slug>\[.+\]).*(?<ext>\..+)\.js$/);
const isSingleFile = matcher<'ext'>(/(?<ext>\..+)\.js$/);

// usage
if (match = isArrayFile(file.path)) {
  match.ext // string
  match.slug // string
}
else if (match = isSingleFile(file.path)) {
  match.ext // string
}