python / typing

Python static typing home. Hosts the documentation and a user help forum.
https://typing.readthedocs.io/
Other
1.57k stars 231 forks source link

Language specifications for strings #1370

Open boxed opened 1 year ago

boxed commented 1 year ago

In PyCharm you can set the programming/markup language of strings like this:

# language=html
foo = '<div>hello there!</div>'

I find this extremely useful and use it all over the place. A pattern I noticed is that a large proportion of such uses are in fact more like:

# language=html
foo = format_html("<div>{}</div", bar)

In this case the format_html function doesn’t take any random string as the first argument, it takes a string that is supposed to be html. Reading through my usage of # language= I see that I have CSS, HTML, and JavaScript.

There are many places in my codebases that also have very different types of “languages” that unfortunately isn’t supported as a language injection in PyCharm. Some examples:

fully qualified name (module.module.symbol, for example view functions in Django) module names (module.module, for example app names in Django)

There are probably more, but I think this gets the point across.

PyCharm has some (presumably hardcoded) rules about some of these strings, for example in settings.py it knows that strings in the list INSTALLED_APPS are module names, so you can jump to the definitions of those modules, and PyCharm will resolve and check them for you. But this is a closed system where any introduced variables I create myself can’t be validated in this way.

I think it would be good if python typing could have a facility for this type of thing. When this gets some traction we could see support for it in Python language servers, PyCharm, static analysis tools, etc.

What do you guys think?

(originally posted at https://discuss.python.org/t/language-specifications-for-strings/21826/1 where it was suggested that this is the right place for this discussion)

srittau commented 1 year ago

There are currently two ways that this could be implemented. Using NewType, which constrains the type:

from typing import NewType, cast

HTMLString = NewType("HTMLString", str)
def print_html(html: HTMLString) -> None: ...

html = cast(HTMLString, "<div>Hi!</div>")
print_html(html)  # works
print_html("<script ...>")  # error

Using Annotated:

from typing import Annotated

def print_html(html: str) -> None: ... 

html: Annotated[str, "html"] = "<div>Hi!</div>"
print_html(html)

In both cases, the hard part is finding common ground for tool manufacturers to support this notation.

AlexWaygood commented 1 year ago

(originally posted at https://discuss.python.org/t/language-specifications-for-strings/21826/1 where it was suggested that this is the right place for this discussion)

Sorry for the ping-pong -- this issue really belongs over at the python/typing repo, rather than python/typeshed. @srittau, could you transfer it over? :)

boxed commented 1 year ago

@srittau

In both cases, the hard part is finding common ground for tool manufacturers to support this notation.

100% agreed. It's clearly a "build it and they will come" kind of thing. It has to be built first, and people need to start pushing implementations for it all over the place. I can certainly write lots of PRs, but if the basic architecture isn't there, then I can't :P

erictraut commented 1 year ago

There has been some discussion of this topic in the pylance discussion forum.

boxed commented 1 year ago

@srittau Looking again at your example code I think you must have misunderstood the idea. Changing your example to what I want:

def print_html(html: HTMLString) -> None: ...

html = cast(HTMLString, "<div>Hi!</div>")
print_html(html)  # works
print_html("<script ...>")  # ALSO WORKS 

The proposal here is that the second case (print_html("<script ...>")) the tooling could know it's an html fragment, and so can apply syntax highlighting for example. Or check matching tags.

For re.sub this would be a huge step up in usability! And we can specify these languages for strings centrally in Pythons standard library or in typeshed, and when PyCharm and the Python Language Server supports this everyone will get syntax highlighting for regexes in their regex calls automatically without having to change their code and annotate every argument.

boxed commented 1 year ago

A discussion at https://github.com/microsoft/pylance-release/discussions/3952 lead to the concrete suggestion of a Language type used as Language['filename_extension'].

erictraut commented 1 year ago

I don't think Language['filename_extension'] would work. It would require a bunch of special casing in every runtime type checking library and static analysis tool because a quoted string in the position of a type argument is generally treated as a forward-declared symbol and is parsed as such. Literal is only one exception to this today, and that requires significant special casing.

The other suggestions above, including NewType and Annotated, do not suffer from this problem.

boxed commented 1 year ago

@erictraut I don't understand the distinction. Why can't Language["html"] be implemented as some variant of Annotated[str, "html"]?

Anyway, I think my point was more about using the word "language" somewhere. Annotated[str, "html"] suffers from adding an annotation string "html" without the context that it's a language it's talking about. To a human that knows a ton about programming that's not a big difference, but to tooling it's the difference between being able to automatically install language support for the specified language, and not.

adriangb commented 1 year ago

I really like the Annotated solution. Runtime typing has been trying to standardize related metadata, the latest effort is over at https://github.com/annotated-types/annotated-types. It would be very cool if you could do something like:

class MyModel:
    template: Html  # aka Annotated[str, Html(flavor=?)] or similar
    regex: Regex  # aka Annotated[str, Regex(flags=0)] or similar

And have that be available both for static type checking and runtime type checking so that MyModel(regex="<some invalid regex>") gets flagged by static type checkers and MyModel.validate({'regex': 'some unverified data}) gets runtime type checked, all from having a single annotation without having to duplicate it.

boxed commented 6 months ago

A related discussion for PyCharm: https://youtrack.jetbrains.com/issue/PY-53278/Cant-inject-template-language-into-Python-string