Introduce a `Language` type to provide consistent language information of strings.

ilotoki0804 commented 5 months ago

Currently, Python has no consistent way to indicate when a programming language is represented as a string that the string follows the syntax of a particular programming language.

This means that languages represented as strings cannot be syntax highlighted, resulting in a significant loss of productivity, readability, and an increase in bugs and errors when dealing with other languages as strings.

This article gives an example of the current problem.

...

A component in my library is a combination of python code, html, css and javascript. Currently I glue things together with a python file, where you put the paths to the html, css and javascript. When run, it brings all of the files together into a component. But for small components, having to juggle four different files around is cumbersome, so I’ve started to look for a way to put everything related to the component in the same file. This makes it much easier to work on, understand, and with fewer places to make path errors.

Example:
class Calendar(component.Component):
    template_string = '<span class="calendar"></span>'
    css_string = '.calendar { background: pink }'
    js_string = 'document.getElementsByClassName("calendar)[0].onclick = function() { alert("click!") }'
Seems simple enough, right? The problem is: There’s no syntax highlighting in my code editor for the three other languages. This makes for a horrible developer experience, where you constantly have to hunt for characters inside of strings. You saw the missing quote in js_string right? 🙂

...

Traditional approaches and issues

Typical case

Typically, syntax highlighting is not provided at all because there is no way for the editor to know the language of the string, which leads to several drawbacks.

Batch syntax highlighting of raw strings for regexes in VSCode

VSCode provides simple syntax highlighting for regexes when using raw strings, as shown below.

However, this approach has several drawbacks. First of all, it doesn't generalize to languages other than regexes. Also, since raw strings aren't just for regexes, it creates a visual distraction for people who want to use raw strings for non-regex reasons, such as Windows paths.

Below is an example of syntax highlighting for regex applied to Windows path, which actually reduces readability.

`Language` and `LiteralLanguage`

Language is a subtype of str that indicates that the string represents a specific language. LiteralLanguage is a subtype of LiteralString, and is used in the same way as Language.

Language takes a single type argument, and in its place you put the name of the language, for example, Language["html"].

Editors should provide basic syntax highlighting for string literals set to types Language or LiteralLanguage. Consider code blocks in Markdown.

The Language type may also be implied by the type of the parameter.

from typing import Language

Language["html"] # The brackets hold the name of the language.

my_css: Language["css"] = "p { font-size: 20px; }" # This string is considered CSS and should be syntax highlighted.

def get_html(html: Language["html"]):
    ...

get_html("<p>hello, world!</p>") # This string is considered HTML and should be syntax highlighted.

def dreamberd_compiler(code: Language["java"]):
    ...

# `Language` can also be used in "reasonably similar code". This code should have syntax highlighting for Java.
dreamberd_compiler("var var hello = 123!")

def get_path(path: str):
    ...

def get_pattern(pattern: Language["re"]):
    ...

# Now it's not syntax highlighted as simply a raw string.
get_path(r"C:\Users\user\python.py") # This string shouldn't have any syntax highlighting.
get_pattern(r"a\rb+b?[abc]", "...") # This string should be syntax highlighted as a regex.

Errors

It is difficult to set the Language type to remain a Language type after an operation, as this would complicate the implementation and make it difficult to provide a clear criterion for the type.

For example, does Language["A"] + Language["A"] always result in Language["A"]? Of course it often does, but it's very hard to generalize.

The case of Language["A"] + Language["B"] is also tricky. Should we catch the type as Language["A"], or should it be Language["B"]? And what about Language["A"].strip()? It's hard to maintain consistency or a single standard for these operations. Therefore, Language should be considered more as a feature for annotation than for complex static type checking.

Therefore, a type checker should accept the target of a given Language type as legitimate if it is a string, regardless of its contents, and an editor should not raise an error if it fails to parse.

Developers should also not expect that when they accept a value annotated with `Language' that the string is fully valid code that will pass the language's compiler.

Conversely, Language can be used for code that is "reasonably close" to the appearance of the language. Developers should consider whether syntax highlighting helps or hinders users when deciding whether to use Language or just use str for languages that are not exactly the same as the target language.

Post-operation type

The type Language should be treated as str when computed, and LiteralLanguage should be treated as LiteralString when computed.

# In the case of `LiteralLanguage`

literal_html: LiteralLanguage["html"] = "<h1>Hello, world!</h1>"
literal_sql: LiteralLanguage["sql"] = "SELECT CustomerName, City FROM Customers;"
my_literal: LiteralString = "Contents: {}"

# When two different languages are synthesized, both variables are considered `LiteralString`.
reveal_type(literal_html + literal_sql)  # type: LiteralString

# If two strings of the same language are composited, or if `Language` is composited with a `Literal`, it should still be considered a `LiteralString`.
reveal_type(literal_html + literal_html)  # type: LiteralString
reveal_type(literal_html + " ")  # type: LiteralString
reveal_type(f"""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
    {literal_html}
<body>
""")  # type: LiteralString

# A `LiteralString` can be cast to a `LiteralLanguage`.
template: LiteralLanguage["html"] = reveal_type(f"""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
    {literal_html}
<body>
""")  # type: LiteralLanguage["html"]

# For all other operations, a `LiteralLanguage` is considered a `LiteralString`.
reveal_type(my_literal.format(literal_html))  # type: LiteralString
reveal_type(input() + literal_html)  # type: str

# In the case of `Language`

use_input = input().lower().startswith("y")
html: Language["html"] = "<h1>Hello, world!</h1>" if use_input else input()
literal_sql: LiteralLanguage["sql"] = "SELECT CustomerName, City FROM Customers;"
css: Language["css"] = "SELECT CustomerName, City FROM Customers;"
my_literal: LiteralString = "Contents: {}"

reveal_type(html + literal_sql)  # type: str
reveal_type(html + query)  # type: str

reveal_type(html + html)  # type: str
reveal_type(html + " ")  # type: str
reveal_type(f"""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
    {html}
<body>
""")  # type: str

template: Language["html"] = reveal_type(f"""
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>
    {literal_html}
<body>
""")  # type: Language["html"]

reveal_type(my_literal.format(html))  # type: str
reveal_type(input() + html)  # type: str

`BytesLanguage`?

ByteLanguage is the bytes version of Language. We should think about whether we need this type.

However, there is no type called LiteralBytes, so at least LiteralBytesLanguage can't exist.

Language names

The language identifier in Language must be lowercase, e.g. Language["python"] instead of Language["Python"].

For language names, it seems like a good idea to use what is used for code blocks in Markdown that developers are familiar with, but the exact definition of this is up to the editor.

Supported languages

A list of supported languages is beyond the scope of this documentation and should be up to each editor's implementation. However, editors should be able to provide basic syntax highlighting for common languages like Python, HTML, SQL, etc.

dmwyatt commented 5 months ago

I just want to note that PyCharm allows you to tell it that any string literal is of any language it supports and basically supports all the IDE features for that language.

You do have to manually tell it this though.

edit: You can also add comments like # language=<language_ID> before a string literal in PyCharm and it will know what to do. There are also rules for automatic injections.

This isn't a vote for or against a Language type, just providing additional context.

TeamSpen210 commented 5 months ago

Instead of an actual type, any reason not to do Annotated[str, Language("html")]? That would inherently allow LiteralString and bytes, and has all the correct type semantics. It is a little more verbose, but a type alias solves that.

erictraut commented 5 months ago

This topic has been explored in some depth within this pylance discussion.

To me, this doesn't seem like something that necessitates an extension to the type system. There are many ways that can be specified using existing type system or language constructs.

ilotoki0804 commented 5 months ago

@dmwyatt Using comments to indicate language is fine. However, using Language has several advantages over using comments.

Automatically applied

While comments are limited to a single string and must be manually marked, types have the advantage of being automatically applied to any code that uses functions or other typed elements.

For example, any code that uses the following execute function will automatically receive syntax highlighting for JavaScript.

def execute(code: Language["js"]):
    ...

No collision when multiple strings overlap

Consider a function like this

def build_html_from_markdown(article: str, style: str, script: str) -> str:
    """`article` takes Markdown, `style` takes CSS, `script` takes JavaScript."""
    ...

This function uses a bunch of different languages at once, thus there's a bit of ambiguity when using it.

# language=???
build_html_from_markdown("# hello, world!", "h1 { background: pink; }", "alert('welcome!')")

This could be fixed by modifying the function to span multiple lines, but this effectively demonstrates that a more fundamental solution is needed.

Using the Langauge type solves this problem.

def build(
    article: Language["markdown"],
    style: Language["css"],
    script: Language["js"],
) -> Language["html"]:
    ...

build("# hello, world!", "h1 { background: pink; }", "alert('welcome!')")

Matches the context in which the type hint is used

Specifying the language as a type reduces the need to express information about it in other ways, and allows users to infer what the parameters of a function require from the type hint before reading the documentation. And it also allows developers to write clearer code, since the type hint expresses the format, leaving the variable or parameter names to express other, more important information. I think this is in line with why type hints were introduced.

Provides a single, consistent, and clear way to represent the language of a string

@erictraut While there are many different ways to tell what language a string contains, there hasn't been a single, universally followed method. But it's important enough that we need a formal, documented way of doing it, and I think the Language type is appropriate for it.

Alternatives

`Annotated[str, Language("html")]`

@TeamSpen210 If we can achieve static syntax highlighting with this implementation, I think it might be one of the options to consider, but I wonder if we can achieve syntax highlighting via Annotated.

`Language[str, "html"]`.

The idea of specifying a type for the first parameter of Language is worth considering.

In my opinion, this alternative might be better if it is decided to implement LanguageBytes, but if not, it would be better to just use Language and LiteralLanguage in favor of simplicity over extensibility.

dmwyatt commented 5 months ago

Yes, I agree that the comments implemented by PyCharm are not fool-proof. I was merely pointing towards prior art.

I think the underlying thing you're reaching for might be the lack of a standardized way of annotating languages in strings that is accepted across all IDEs and editors? Maybe types are the best way to get to that point...I don't know. I certainly like the idea.

The more generalized issue is the lack of a way to specify the structure or type of the data in a string. One can imagine there are many types of data that can be contained in a string, and programming languages are just one of them.

I'm very sympathetic to all of these ideas:

"Officially" recognized ways of doing things are a way of getting consensus...if the typing module had a Language type it would surely be fairly quickly adopted by many IDEs and I'd use the heck out of it.
Maintainer resources are limited.
"officially" recognizing a way of doing something like this is kind of dangerous because maybe we find out later we officially recognized doing something in a wrong or limiting way.

python / typing