Support a type representing any literal string, a la Python's LiteralString type

ethanresnick commented 2 years ago

🔍 Search Terms

literal string, xss, sql injection, security, user input handling

✅ Viability Checklist

My suggestion meets these guidelines:

[x] This wouldn't be a breaking change in existing TypeScript/JavaScript code
- Mostly satisfied: some tiny number of existing programs might break depending on the name chosen for the type.
[x] This wouldn't change the runtime behavior of existing JavaScript code
[x] This could be implemented without emitting different JS based on the types of the expressions
[x] This isn't a runtime feature (e.g. library functionality, non-ECMAScript syntax with JavaScript output, new syntax sugar for JS, etc.)
[x] This feature would agree with the rest of TypeScript's Design Goals.

⭐ Suggestion + Motivating Example

The idea is to add a built-in type called LiteralString, which would be the supertype of all literal string types. Ie, LiteralString is inhabited by all the subtypes of string, excluding string itself and template string types that contain string. In addition to introducing this type, TS would be more careful about tracking whether a string has a literal type (eg, when two strings with literal types are concatenated with +, the result would remain a literal type, rather than becoming string).

The motivation here is to allow the type system to check that certain security-sensitive strings haven't been unsafely manipulated by user-controlled input. For example, one could write a function like queryDb(query: LiteralString, params?: unknown[]): Promise<Results> to enforce that the query string does not have any values interpolated into it that could've been user-controlled and created SQL injection vulnerabilities. The idea is that the value from user input would’ve had to be typed as string, which can’t be mixed into a LiteralString without producing a string, which would then not be an acceptable input to queryDb:

// `id` is type `string`, so the type of 
// this argument is `string`, so the call is not allowed 
queryDb(`SELECT * from a where id = ${id}`)

// however, this type checks, as the first argument is
// inferred as either a literal type (matching its value) or, 
// through contextual typing, as LiteralString
queryDb('SELECT * from a where id = ?', [id])

There is a bunch of prior art for such a type, with identical motivation, including the LiteralString type in Python. There was also a proposal to have JS engines track whether a string was created entirely from literals, which would've been used to allow DOM APIs like innerHTML to treat literal strings as safe, as part of a broader strategy to protect against XSS. (Of course, this TS proposal is compile-time only, but the motivation is the same.) Additionally, there was/is an analogous type in Google's Closure Compiler, with the same motivation. Finally, Scala has an analogous type, Singleton, which is inhabited by all literal types.

Potentially, the built-in type could be calledLiteral, rather than LiteralString, and could also include other kinds of literals (numbers, bigints, etc); APIs which need a string would then do Literal & string, or TS could provide LiteralString as a built-in alias.

I guess there's an argument that tracking all literal values in the same way, and having a unified Literal type, is more elegant, and perhaps there are some use cases outside of security for which such a type would be valuable. For the security use case, though, if an API takes a non-string, and you pass user input to that API (or some value derived from user input), it seems almost certain that you intended to let the user control the API with their input. In these non-string cases, there's nothing analogous to the "you intended to allow the user to provide some data, but they tricked the system into interpreting that data as code" problem that's at the heart of SQL injection, XSS, and related vulnerabilities.

Given all that, I guess I'd propose starting with only LiteralString, as that's presumably less effort to implement and adds less overhead to compile times. If legitimate use cases for a more general Literal type arise, then it's easy to implement that later and redefine LiteralString as Literal & string.

MartinJohns commented 2 years ago

Related: #41114

RyanCavanaugh commented 2 years ago

which would be the supertype of all strings that are declared literally in the program's source text (or derived from such strings, e.g. by concatenating two of them)

From a performance perspective, this seems like a nightmare, and possibly even undecidable given how TS's type system works. Is there a workable formal definition of this?

ethanresnick commented 2 years ago

@RyanCavanaugh For the performance concerns, I don't know the TS implementation well enough to understand the issue. Can you elaborate a bit? The linked PEP did seem to imply that this ended up being simple to implement for Python, which gives me some hope; but, of course, that might not translate to TS for a million reasons.

As far as a definition goes, what type of "formality" did you have in mind? I'm not sure what would be helpful, but I'll try to give some examples...

This is the simplest case: x is inferred as a literal type today, so it's assignable to LiteralString:

const x = "hello";
let y: LiteralString = x;

Beyond that, the most common way of combining literal strings is probably with + (including for multi-line strings), so I think an assignment like the below would need to work:

const x = "hello" + " world";
let y: LiteralString = x;

If the type of x above could be inferred as "hello world" rather than string, then this collapses into the first example. Changing this intrinsic behavior of + seems like it could be a useful independent change, but it'd be critical here.

Concatenating w/ template strings would ideally be supported too:

const x = "hello";
const y = `${x} world`;
let z: LiteralString = y;

I think this builds pretty straightforwardly on the above. It also seems like the constant expression machinery for enums might be applicable.

In addition to literal types being assignable to LiteralString, existing LiteralStrings can be concatenated with each other, with the result being a LiteralString.

let a: LiteralString = "SELECT * from foo";
if(applyLimit) {
  a +=  " LIMIT 1"; // assignment should succeed.
}

More generally, if typeof s is LiteralString, the expressions s + 'xyz' and `${s}xyz` could be typed as just LiteralString, or TS could try to preserve more info by producing the type `${LiteralString}xyz`.

I think this gets tricky with widening. I.e., in let x = "hello", is x typed as a string or LiteralString? My understanding is that, TS cannot easily support something like the below:

declare const a: string;
let query = "SELECT * from foo";
await executeQuery(query); // ok. typeof query = LiteralString

query += a; // typeof query silently changes to string upon concatenating the non-literal string `a`
await executeQuery(query) // this now fails

Assuming the above can't be supported, I think we'd have to keep the current behavior where let x = "hello types x as simply string. People who want to imperatively build up a LiteralString from pieces would have to explicitly use an annotation:

let query: LiteralString = "....";

Finally, for LiteralString to be useful more generally, there'd need to be overloads for some of the built-in string functions to preserve LiteralString-ness. (The Python proposal has a list of these.)

// again, an explicit annotation's probably needed here, or `as const`;
// otherwise, `conditions` inferred as just `string[]`.
const conditions: LiteralString[] = [
  "status = 'published'", 
  "created_at > '2022-01-01'", 
  "author_id = ?"
];

await query(`SELECT * from posts WHERE ${conditions.join(' AND ')}`);

So, there's an implicit overload on join (assuming query only accepts LiteralString):

interface Array<T> {
    join(separator?: LiteralString): T extends LiteralString ? LiteralString : string;
}

The basic idea would be that any deterministic operation involving only LiteralStrings should be thought of as producing a LiteralString.

For some of these overloads — especially of methods that live on strings — I'm not sure if TS supports a good place to put them. E.g., how would we specify that calling toUpperCase() on a LiteralString produces a LiteralString?

That said, I think defining LiteralString overloads for the built-in methods is the least important part of this proposal. Many times — maybe the majority? — the final literal string will just be written inline, without the user building it up from other literal strings. E.g., you'll just be doing: query("SELECT .... WHERE x = ?", [paramValue1]). For the remaining times when a LiteralString is built up from sub-components, my guess is that the + and join together cover many of the cases. If there are occasional remaining cases where an overload can't easily be provided, then a cast isn't the worst thing — e.g., myLiteralString.split("\n") as LiteralString[].

RyanCavanaugh commented 2 years ago

Ah, I was taking this much more literally (ha!) that LiteralString would actually be a union of all the literally-written strings in the program.

In terms of TypeScript relative to Python, I think there'd be a very difficult cognitive leap at the point where the runtime behavior crosses into the type system behavior. I believe with the definitions given, this program is supposed to have an error, but it seems like a hard sell:

function foo(x: "bar") {
  fn(x);
}
function fn(x: LiteralString) {
}

ethanresnick commented 2 years ago

@RyanCavanaugh Now I’m confused haha. Why would the example code you showed have an error? The type of x is "foo", which is a literal type, so it would be assignable to LiteralString just fine (when calling fn). I’m also not following your comment about the runtime and type system interaction; in TS, this would be a purely compile-time check, which is how it works in Python too (and Scala iiuc).

ethanresnick commented 2 years ago

Ah, I was taking this much more literally (ha!) that LiteralString would actually be a union of all the literally-written strings in the program.

Totally my fault! I see how the original text implied that. I’ve updated the OP to hopefully make it much clearer what I’m actually proposing

fatcerberus commented 1 year ago

Why would the example code [Ryan] showed have an error?

I think the implication was that despite "bar" being a literal type, the specific value of x at runtime is not guaranteed to originate in source code. Of course with a single literal type that doesn’t make any sense, but the problem becomes much clearer if you imagine the type involved is "foo" | "bar" | "baz". It seemed like your intent was that only hard-coded strings are assignable to the proposed type, the string doesn’t ever get to be chosen out-of-band (e.g. by the caller of a function).

qraynaud commented 4 months ago

For now I found a shitty-workaround for this:

const fn = <const S extends string>(str: string extends S ? never : S) => {}

fn("test") // passes
fn("test" as string) // fails

microsoft / TypeScript