Tag string use-case: language marker for syntax highlight

EmilStenstrom commented 1 year ago

In PyCharm it's possible to write # language=html before a string, and have that string be syntax highlighted as html (or any other supported language). It's not supported by any other editor. It's clunky because it's not obvious what happens if the comment is on the same line as other code, if there's a newline in between, and so on.

What if there was a way to tag a string, so the code editor knows how to syntax highlight that string. Oh wait, that's what you're working on! :)

My use-case A component in my library is a combination of python code, html, css and javascript. Currently I glue things together with a python file, where you put the paths to the html, css and javascript. When run, it brings all of the files together into a component. But for small components, having to juggle four different files around is cumbersome, so I've started to look for a way to put everything related to the component in the same file. This makes it much easier to work on, understand, and with fewer places to make path errors.

Example:

class Calendar(component.Component):
   template_string = '<span class="calendar"></span>'
   css_string = '.calendar { background: pink }'
   js_string = 'document.getElementsByClassName("calendar)[0].onclick = function() { alert("click!") }'

Seems simple enough, right? The problem is: There's no syntax highlighting in my code editor for the three other languages. This makes for a horrible developer experience, where you constantly have to hunt for characters inside of strings. You saw the missing quote in js_string right? :)

If I instead use separate files, I get syntax highlighting and auto-completion for each file, because editors set language based on file type. But should I really have to choose?

Proposal Would it be compatible with your work, to recommend editors to syntax highlight strings in python that has markers that correspond to this list of language identifiers? https://code.visualstudio.com/docs/languages/identifiers#_known-language-identifiers

This would make the developer experience even better for developers, and make your example tags in this repo even more powerful. html'<span class="calendar">{content}</span>' would be highlighted as proper HTML (because the name html was used).

Advantages:

Gives code editors a great API to standardize around. # language=html is not it...
Makes it easier to spot errors inside those tagged strings
Using standard names makes the feature discoverable for people that did not read the docs. "Oh, I get syntax highlighting when I use this tag, nice!"
Everything just works for developers, no config, no installing stuff.

Even without my suggestion above, I really like what you're doing here.

Let me know what you think!

rmorshea commented 1 year ago

Would it be compatible with your work, to recommend editors to syntax highlight strings...

Typically I think this sort of thing ends up being handled by plugins. For example, out of the box VSCode does not highlight Javascript template literals. Instead people tend to use plugins like lit-html which, in addition to syntax highlighting, provides IntelliSense.

With that said, it would be beneficial to supply plugins for popular editors (probably VSCode and PyCharm) by the time the Python version containing these changes is released. Ultimately, if those plugins became popular enough it could make sense for the behavior to get built-in into PyCharm or the Python extension for VSCode.

Before then though, it seems like the main action we would take with respect to this PEP would be to mention that editors could supply features of this nature.

pauleveritt commented 1 year ago

I'm with PyCharm and I'm hopeful that (a) this all lands and (b) it makes it easier to support something basic out-of-the-box for IDEs and (c) more custom usages through plugins.

For my component-thingy, not sure if the proposal would fit exactly, as it isn't anything on the list, as @rmorshea hints at. But likely that could be fixed.

rmorshea commented 1 year ago

hopeful that this all lands...

I'm in a bit of a lull at work right now so I think I may have some time to work on this over the holidays.

EmilStenstrom commented 1 year ago

@rmorshea Do you mean time to land tagstr in cpython main? That would be awesome :)

rmorshea commented 1 year ago

Unfortunately no, this is very much a work in progress. At the moment, we don't have a complete draft proposal to share and get feedback on. I'm not super familiar with CPython's release timelines, but my guess is that we'd be shooting to get this into CPython 3.13.

EmilStenstrom commented 1 year ago

Does that mean there's a group of people working on this atm? I see no activity in this repo.

rmorshea commented 1 year ago

At the moment it's @jimbaker and myself with support from @gvanrossum who, in addition to providing feedback and ideas, has contributed a branch of CPython with an initial implementation of tag-strings that we've been using to test out our work. With that said, both Jim and myself have gotten busy as of late. Right now we each have draft PRs up that respectively, contain an initial specification, and a tutorial on how tag strings could be used to render HTML templates. There's a lot more work to be done though.

jimbaker commented 1 year ago

@rmorshea Thanks for following up, I was out of the loop during the holidays.

@EmilStenstrom It's a really good idea, I'm just trying to think how a tag function could declare itself as supporting a specific syntax. We all "know" that html or sql means something when we are looking at the code, because we are all writing self-documenting code ;) , but connecting that automatically to a specific tag function is the interesting trick. Maybe it's just enough to assume html or htm means HTML, etc. Another thought is that the tag function could be appropriately annotated so that the syntax highlighted could determine this, but there's the initial setup problem - what annotations, and how? My current thought is through some standard set of decorators that could decorate tag functions.

Related is that we will be working in the context of https://peps.python.org/pep-0701/ - work that also was discussed at PyCon back in April. The formalization of f-strings is very much related to the formalization we need to do for tag strings, and consequently any general syntax support.

EmilStenstrom commented 1 year ago

@jimbaker I opened an issue with the vscode folks here for their opinions: https://github.com/microsoft/pylance-release/issues/3874 - including some different ways this could work. I think the most straightforward way of doing this is to simply use the language identifiers that editors already use, and automatically apply syntax highlighting in that language, when a tag string with that name is used. This would make the feature discoverable in a way where things just work when you use a tag string which would be wonderful.

If you would like avoid mistakes I think using annotations for this is nice. Annotated["html", str] could be specified in the tag function to make it highlight like you want.

pauleveritt commented 1 year ago

Speaking for PyCharm, I'm also interested. I sponsored a PyCharm plugin to explore the kinds of things I wanted from my htm.py-based component stuff. I could share some opinions but it's likely bike shedding.

jimbaker commented 1 year ago

Speaking for PyCharm, I'm also interested. I sponsored a PyCharm plugin to explore the kinds of things I wanted from my htm.py-based component stuff. I could share some opinions but it's likely bike shedding.

I think we don't even have the available colors picked out, so it cannot be bikeshedding just yet. 😁

One eventual possibility is a DSL registry, similar to https://www.schemastore.org/json/ Maybe that's easier because of the standard JSON Schema, but it does seem to solve the bootstrapping problem - how to connect a tag function with the syntax it supports. In the interim, we can also do something similar to using using a setting with a plugin would work - for this qualified name that can be imported from some Python package (private or on PyPI), it corresponds to this DSL, such as HTML or SQL. On the JSON schema side, I have used this setup for internally developed JSON schemas:

Note that while DSLs may vary, the interpolations themselves do not change except with respect to their syntactic placement and of course semantic meaning. See @rmorshea's recent additions in https://github.com/jimbaker/tagstr/blob/main/tutorial.rst for some more discussion, where we would want to have some strictness about placement, in this case raising SyntaxError("Cannot interpolate attribute names"). The net of this is that Alice and Bob could each implement their own HTML tags, and so long as they agree that interpolations can go in certain places (and maybe not others), the syntactic support would work equivalently.

benji-york commented 1 year ago

One eventual possibility is a DSL registry

For informational purposes, GitHub's syntax highlighting uses Linguist which maintains a project-internal registry.

EmilStenstrom commented 1 year ago

I think a great solution for this would be if all editors by default shipped with a list like Linguists registry, or PyCharms language identifiers, and automatically applied syntax highlighting to tag strings based on the tag name. Optionally there could be settings to say "in this project, I want this tag to highlight in this language".

Advantages:

Everything just works. If you make a custom tag for a new language, you automatically get syntax highlighting for it if you happen to pick that language's name (likely).
All of the responsibility of highlighting is moved to editors, so you don't really have to adapt the PEP to this. Except if you decide to ship pre-made tags, ideally pick the same language-identifers for them.

Disadvantages:

There is a risk that incorrect highlights occur. Specifically: "R" is a language that clashes with raw strings. Does that mean you handle that in editor settings? There are lots of languages, so new tag names that are not languages, might accidentally hit a language name and get incorrect highlights. Maybe a better strategy is to just include the top10 languages that are usually embedded into Python by default? Suggestion: HTML, CSS, JS, SH, SQL, XML, CSV/TSV, JSON, TOML, YAML?

jimbaker commented 1 year ago

I think a great solution for this would be if all editors by default shipped with a list like Linguists registry, or PyCharms language identifiers, and automatically applied syntax highlighting to tag strings based on the tag name. Optionally there could be settings to say "in this project, I want this tag to highlight in this language".

Let's assume a registry like this was supported by editors.

Let's recall that tag names are just standard names in Python, bound (presumably) to some callable that supports the tag function protocol. Such callables can be imported and defined as usual. Python IDEs like PyCharm and VSCode can readily track their use and definition, much like other object usage.

So then it just becomes a question of registering the callable as supporting a specific DSL, like HTML or SQL. Especially in the case of SQL this also applies to dialect, such as SQLite vs Postgres. We just have to figure out how to do this registration.

Advantages:

Everything just works. If you make a custom tag for a new language, you automatically get syntax highlighting for it if you happen to pick that language's name (likely).

All of the responsibility of highlighting is moved to editors, so you don't really have to adapt the PEP to this. Except if you decide to ship pre-made tags, ideally pick the same language-identifers for them.

Yes, this would be ideal.

Disadvantages:

There is a risk that incorrect highlights occur. Specifically: "R" is a language that clashes with raw strings. Does that mean you handle that in editor settings? There are lots of languages, so new tag names that are not languages, might accidentally hit a language name and get incorrect highlights. Maybe a better strategy is to just include the top10 languages that are usually embedded into Python by default? Suggestion: HTML, CSS, JS, SH, SQL, XML, CSV/TSV, JSON, TOML, YAML?

r, R, and other prefixes that are currently predefined for Python string literals are not allowed to be tag names, so there's no conflict here.

There are two possible ways for the IDE to not properly recognize the desired language, assuming this registry:

Code has monkey patched the name so as to rebind it. This is generally bad practice in Python, outside of test patching (and even then, we expect it to be appropriately mocked).
Imports are indirect, via importlib or related functionality. In this case, the tag can't be automatically associated.

EmilStenstrom commented 1 year ago

Some updates: The Python extension people with VSCode are currently thinking about adding support for highlighting of python strings with other languages in them. The exact syntax is not decided, but there is technical triage going on about how to embed other languages inside python. Great news, and something I think would align nicely with this proposal.

Actually, I think this would greatly enhance working with Python overall, in the worlds most popular code editor. I'm excited! :)

It seems to get the ball rolling they need to see that people are excited about this. This is measured by the number of upvotes on this ticket: https://www.github.com/microsoft/pylance-release/issues/3874 - can I hope for some upvotes for you and some close friends of yours? :)

EmilStenstrom commented 1 year ago

So then it just becomes a question of registering the callable as supporting a specific DSL, like HTML or SQL. Especially in the case of SQL this also applies to dialect, such as SQLite vs Postgres. We just have to figure out how to do this registration.

My suggestion would be to map the tag name directly to the language with that name. But maybe that would be too aggressive? And yeah, that prohibits the use of r, since that's reserved as you say.

Something that's come up from some people I've talked to is using the typing system for this. So that if you type the callable as a i.e. Annotated[str, "html"] editors would know what you mean. What are your thoughts on that? I think using the language identifiers as defined by PyCharm would make this really easy to use.

jimbaker commented 1 year ago

Something that's come up from some people I've talked to is using the typing system for this. So that if you type the callable as a i.e. Annotated[str, "html"] editors would know what you mean. What are your thoughts on that? I think using the language identifiers as defined by PyCharm would make this really easy to use.

I think the use of PEP 593's Annotated makes a lot of sense. DSL registration for editors like PyCharm or VSCode needs to use static analysis, and PEP 593 supports this through a general scheme.

There are still conventions we need to use here for the metadata, but this approach seems to be the best. Thanks for the suggestion!

EmilStenstrom commented 1 year ago

Hmm... one issue with using Annotated is that it's not backwards compatible. There's still a lot of code out there that's on 3.6, and I'd guess editors would like to support them to. Annotated is from 3.9 if I understand the PEP above correctly.

jimbaker commented 1 year ago

@EmilStenstrom as new language syntax, this functionality is not backwards compatible.

Of course there are workarounds, similar to polyfills in JavaScript. @rmorshea mentions one, a transpiler - https://github.com/jimbaker/tagstr/issues/20 - and as @gvanrossum pointed out there, this can be implemented with codecs, which allow for arbitrary source code rewriting. There are a number of packages that use this approach. See for example https://github.com/pyxl4/pyxl4 and its predecessor packages.

rmorshea commented 1 year ago

it's not backwards compatible.

There's always the old type comments:

my_html = "<h1>Hello, World!</h1>"  # type: Annotated[str, "html"]

Not quite as slick, but I'm pretty sure editors will treat this the same as a normal type annotation.

EmilStenstrom commented 1 year ago

@rmorshea excellent workaround. I’m sold on using Annotated :)

arogozhnikov commented 1 year ago

I'm currently missing how IDEs would apply (assumingly existent) highlighting for html/sql/js/css/whatever if those contain any thunks:

html"<h1 {attr}={value} style={style} **{other_attributes}>Hello, {name}!</h1>"

Additional complexity is multiple extensions like htm<p ...${attributes}></p> in JS already extend html syntax.

pauleveritt commented 1 year ago

@arogozhnikov I'm with PyCharm so I can only comment on it.

First, I can "language inject" a string and tell it that it has HTML inside of it. I can prefix f to make it an f-string with HTML. This would work with IDEs that support this PEP aka Python 3.13.

The ** part isn't part of Python f-strings. You want a spread syntax as in your ...${attributes}. In the implementation it appears as a dict with ** finishing the previous string. So it is something you could implement in your html tag function.

Lots of alternatives that people could investigate, if you didn't want the magical ... spread syntax.

FWIW my ViewDOM package (actually, the underlying htm.py package) does a spread approach and it is very convenient.

arogozhnikov commented 1 year ago

I think you missed my point. It is not about how to implement this syntax

› Lots of alternatives that people could investigate...

And they likely will. And there will be several dialects of tag strings for html/SQL/PRQL/yaml/whatever, as it is currently in JS. And each syntax needs syntax check and highlighting, because they are not compatible.

I see no way for IDEs too support all of that without a standard way to expose syntax/highlight functions by tags themselves. Having a separate extension for every IDE (or even just two of them) for every package is a ton of work and maintenance - extension maintainer depends both on IDE and package, and two languages, plus he still needs to implement syntax analysis. With rare heroic exceptions it will not be maintained for long.

Tag's statically inferable type should provide an additional function to run check/highlight, and IDE extension should invoke it.

On Sun, 14 May 2023, 05:25 Paul Everitt, @.***> wrote:

@arogozhnikov https://github.com/arogozhnikov I'm with PyCharm so I can only comment on it.

First, I can "language inject" a string and tell it that it has HTML inside of it. I can prefix f to make it an f-string with HTML. This would work with IDEs that support this PEP aka Python 3.13.

The part isn't part of Python f-strings. You want a spread syntax as in your ...${attributes}. In the implementation it appears as a dict with finishing the previous string. So it is something you could implement in your html tag function.

Lots of alternatives that people could investigate, if you didn't want the magical ... spread syntax.

FWIW my ViewDOM package (actually, the underlying htm.py package) does a spread approach https://viewdom.readthedocs.io/en/latest/examples/components.html?highlight=spread#spread-props and it is very convenient.

— Reply to this email directly, view it on GitHub https://github.com/jimbaker/tagstr/issues/18#issuecomment-1546888266, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQGVW3DLLDESVP6H6BBY2LXGDFNJANCNFSM6AAAAAATCF2UEA . You are receiving this because you were mentioned.Message ID: @.***>

pauleveritt commented 1 year ago

True that every-possible-idea-not-yet-conceived will be hard to keep up with, as in JS. But a big chunk gets transferred out of custom-template-language (the Python status quo) into normal Python f-string and expression semantics and thus supported. IMO, a REALLY big chunk.

But you're right, there's room for some tagsplus package with idioms as they reach agreement. Shouldn't be in Python itself, though.

jimbaker commented 1 year ago

@arogozhnikov you raise some very good points.

Lit HTML (https://lit.dev/docs/templates/expressions/) and React JSX (https://react.dev/learn/javascript-in-jsx-with-curly-braces) take different approaches here. (Technically JSX doesn't build on tagged template literals, because it predates that adoption; for our purposes, let's just assume it does.)

Lit has a richer syntax, not that different from your use of splat, with @ (handlers), . (properties), and ? (booleans) providing context for how the expression is to be interpolated. However, I do wonder if we could instead just have the convention that the type of the interpolated expression is what matters here, vs the use of a sigil; at the very least this feels more Pythonic. (I recently wrote a bash script where I'm writing stuff like this in a shell interpolation, ${tests[@]} to have this output of one command splatted appropriately into another command usage.)

If we simply the use interpolation type, then we can assume that for at least a target DSL like HTML, per the Lit doc,

Templates must be well-formed HTML when all expressions are replaced by empty values.

and further simplified by the fact that there's no context sigil for the expression.

If that's the case, we can syntax highlight HTML in a standard way across implementing tags; and even typecheck standard DOM elements and their attributes (style should be a dict, hidden or checked should be a boolean, etc).

Let me think about other DSLs, but even then I think this convention can hold - if the interpolating expressions were replaced by an empty value, and the DSL is still well-formed, then the syntax highlighter should still work as expected.

I found https://code.visualstudio.com/api/language-extensions/embedded-languages interesting, along with https://www.jetbrains.com/help/idea/using-language-injections.html (which uses Java annotations to mark embedded usage). I'm sure there's a lot more we can look at.

arogozhnikov commented 1 year ago

Interesting idea.

Templates must be well-formed HTML when all expressions are replaced by empty values.

Agree, need to check how that works for SDLs.

Does it work for them?

<{tagname} {attr}={value} >content</>

I believe this breaks for sql/graphql and other query languages.

# simplest
select a from table where c = {c_value}
# with dynamic columns
select a, {field}, b from table where c = {c_value} order by {order_col}

jimbaker commented 1 year ago

Interesting idea.

Templates must be well-formed HTML when all expressions are replaced by empty values.

Agree, need to check how that works for SDLs.

Does it work for them?
<{tagname} {attr}={value} >content</>

Let's try it out. I find returning the args with an identity function useful for thinking about what tag functions do:

$ ./python
Python 3.12.0a7+ (heads/tag-strings-v2:f93052d4fb, Apr 26 2023, 16:40:39) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> def html(*args): return args
...
>>> html'<{tagname} {attr}={value} >content</>'
('<', (<function <lambda> at 0x7fd52045f5f0>, 'tagname', None, None), ' ', (<function <lambda> at 0x7fd52045f6a0>, 'attr', None, None), '=', (<function <lambda> at 0x7fd52045f750>, 'value', None, None), ' >content</>')

So we get an alternation of the (raw) text and thunks for each expression. Let's substitute in for each thunk the text PLACEHOLDER. This is still well-formed HTML, and it's true recursively if 'content' was also replaced with some arbitrary HTML with interpolations:

html'<PLACEHOLDER PLACEHOLDER=PLACEHOLDER >content</>'

Interestingly, we take this approach in this example when actually parsing the HTML template, to simplify working with the underlying the use of html.parser.HTMLParser (and not use internal private functionality) - https://github.com/jimbaker/tagstr/blob/main/examples/htmldom.py#L207

I believe this breaks for sql/graphql and other query languages.
# simplest
select a from table where c = {c_value}

So we can try this as well. This is well-formed SQL:

sql'select a from table where c = PLACEHOLDER'

and likewise true with this example:

with dynamic columns

select a, {field}, b from table where c = {c_value} order by {order_col}

select a, PLACEHOLDER, b from table where c = PLACEHOLDER order by PLACEHOLDER

So from a straightforward syntactic analysis, we should be able to preserve the well-formed quality of the input text with suitable chosen placeholders, possibly as simple as the one I used.

There's a separate question, which is it possible to do deeper type checking on this? For example, could we type check that the interpolations would produce valid values for expressions in something like c = {c_value} ?. Maybe it's possible for terminals; for nonterminals it does seem much more involved. Regardless, it would not be in scope for the original issue.

jimbaker commented 1 year ago

Also something that one could do is to annotate the expressions with content in the formatspec in the thunk. This of course looks just like a type annotation for a function:

>>> def html(*args): return args
...
>>> html'<div id={id:int} class={cls:list[str]} style={style:dict[str,str]} contenteditable={editable:bool}>{content:html|str}</div>'
('<div id=', (<function <lambda> at 0x7f0ada1bf5f0>, 'id', None, 'int'), ' class=', (<function <lambda> at 0x7f0ada1bf6a0>, 'cls', None, 'list[str]'), ' style=', (<function <lambda> at 0x7f0ada1bf750>, 'style', None, 'dict[str,str]'), ' contenteditable=', (<function <lambda> at 0x7f0ada1bf800>, 'editable', None, 'bool'), '>', (<function <lambda> at 0x7f0ada1bf8b0>, 'content', None, 'html|str'), '</div>')

The idea is that a type checker could verify that the interpolated expressions type accordingly, including avoid truthiness/other type coercions (so familiar in JavaScript).

I don't know if this simply being too clever, or actually useful in real code, so putting this possibility out there. As can be seen, parsing works as "expected", becasue of the operator precedence of : to separate the expression from the formatspec is low:

>>> html'<div {id+100:int}>Name: {f(g(data)):str}</div>'
('<div ', (<function <lambda> at 0x7f0ada1bf960>, 'id + 100', None, 'int'), '>Name: ', (<function <lambda> at 0x7f0ada1bf540>, 'f(g(data))', None, 'str'), '</div>')

EmilStenstrom commented 1 year ago

Slightly tangential: There seems to be a library called python-inline-source that uses types for do inline code highlighting in VS Code. Seems to work in practice today!

pauleveritt commented 1 year ago

Also something that one could do is to annotate the expressions with content in the formatspec in the thunk. This of course looks just like a type annotation for a function:
>>> def html(*args): return args
...
>>> html'<div id={id:int} class={cls:list[str]} style={style:dict[str,str]} contenteditable={editable:bool}>{content:html|str}</div>'

I'm surprised this would be considered legal. A format spec string value of 'list[str]' might be parseable, but wouldn't a linter or IDE complain that it isn't one of the allowed format spec values?

If not, then wow, my dependency injector could work on a template string without a wrapping function.

jimbaker commented 1 year ago

I'm surprised this would be considered legal. A format spec string value of 'list[str]' might be parseable, but wouldn't a linter or IDE complain that it isn't one of the allowed format spec values?

If not, then wow, my dependency injector could work on a template string without a wrapping function.

@pauleveritt it's already possible to use use arbitrary format specs with f-strings, so there's nothing new here:

>>> class ArbitraryFormatSpec(str):
...   def __format__(self, formatspec):
...     print(f'Got this {formatspec=}')
...     return self
...
>>> s = ArbitraryFormatSpec('allows any format spec')
>>> f'Here is a string: {s:list[int]}'
Got this formatspec='list[int]'
'Here is a string: allows any format spec'

Obviously this formatspec doesn't use standard format specifiers, https://peps.python.org/pep-3101/#standard-format-specifiers, but it is allowable, per https://peps.python.org/pep-3101/#controlling-formatting-on-a-per-type-basis There are some nice code examples out there, see for example https://nedbatchelder.com/blog/202204/python_custom_formatting.html, where Ned discusses a Lat Long type and a specific mini formatting DSL for that.

Lastly, VSCode is perfectly happy with it being an arbitrary string in the formatspec position; and it will even helpfully highlight int above as being a type. Bug? Feature?

jimbaker / tagstr

Tag string use-case: language marker for syntax highlight #18