jimbaker / tagstr

This repo contains an issue tracker, examples, and early work related to PEP 999: Tag Strings
51 stars 6 forks source link

Interim Transpiler #20

Closed rmorshea closed 1 year ago

rmorshea commented 1 year ago

I've been wishing I could use tag strings lately and so to satisfy that craving I thought it would be cool to create an import-time transpiler that would rewrite:

my_tag @ f"my {custom} string"
#      ^ or any other operator not typically used with strings

To be:

my_tag("my ", (lambda: custom, "custom", None, None), " string")

The syntax seems clever in a few different ways:

Something like this seems like a rather natural extension of @pauleveritt's work in viewdom.

Implementation Details

Some potential issues I've thought of and ways to solve them.

Static Typing

To deal with static type analyzers complaining about the unsupported @ operator, tag functions can be decorated with:

def tag(func: TagFunc[T]) -> Tag[T]:
    # convince MyPy that our new syntax is valid
    return cast(Tag, func)

class TagFunc(Protocol[T]):
    def __call__(self, *args: Thunk) -> T: ...

class Tag(Generic[T]):
    def __call__(self, *args: Thunk) -> T: ...
    def __matmul__(self, other: str) -> T: ...

An alternate syntax of my_tag(f"...") would not require this typing hack since tag functions must already accept *args: str | Thunk. The main problem here is that there are probably many times where some_function(f"...") would show up in a normal code-base. Thus it would be hard to determine whether any given instance of that syntax ought to be transpiled. Solving this would require users to mark which identifiers should be treated as tags - perhaps with a line like set_tags("my_tag"). This seems more inconvenient than having the tag author add the aforementioned decorator though.

Performance

To avoid transpiling every module, users would need to indicate which ones should be rewritten at import-time. This could be done by having the user import this library somewhere at the top of their module. At import time, the transpiler, before parsing the file, would then scan it for a line like import <this_library> or from <this_library> import ....

gvanrossum commented 1 year ago

Cool, and good that you've already thought through some alternatives in the design space. @ seems brilliant because, indeed, it's not something one would do with an f-string on the RHS.

How were you planning to trigger the transpiler? As an import hook, or as a codec?

rmorshea commented 1 year ago

My plan was to use an import hook, but I wasn't aware codecs could be used for this purpose, so that could be a good alternative as well.

gvanrossum commented 1 year ago

See https://peps.python.org/pep-0263/; you can register a codec with an arbitrary name. This would take the place of your "marker import". IIRC there are some issues with getting the codec registered when your package is installed though.

rmorshea commented 1 year ago

I think a codec could be better for a lot of reasons:

IIRC there are some issues with getting the codec registered when your package is installed though.

Would the approach be to use a .pth file to call codecs.register(my_codec) before the user's code is imported?

gvanrossum commented 1 year ago

Yeah, if the .py file and the .pyc file match, the source is never read, so the codec isn't run. Pure win!

I suspect that there are some problems with asy.unparse though, since the AST doesn't preserve comments or whitespace. Notably the line numbers after the unparsing will differ, which will make tracebacks hugely confusing.

We used the codec trick at Dropbox for pyxl3, and IIRC the rewrite was done very differently, to ensure that the line numbers matched. I think the "parsing" was probably done with a regular expression. That should work here too.

pauleveritt commented 1 year ago

What would be the story for tracebacks and getting back to lines in source?

gvanrossum commented 1 year ago

The traceback code looks up the line number in the untranslated source (linecache.py just opens the file in text mode, no encoding parameter). So pyxl ensures that the translated line numbers match the original line numbers (but the column offsets don't). I think the translation Ryan proposes should be able to preserve line numbers as well.

rmorshea commented 1 year ago

Crazy idea, but what if...

my_tag @ f"my {super} {custom} string"

Became

my_tag((       super,  custom), raw=("super", "custom"), conv=(None, None), formatspec=(None, None), strings=("my ", " ", " string"))

Where the @tag decorator would zip and merge things as necessary to resolve differences between this new interface and the one defined for tag strings. Doing this might seem convoluted, but the neat thing about it is that you'd be able to move the expressions from the string around to match their column offsets as needed. This would even work for multi-line strings:

my_tag @ f"""
my
extra {super}
{custom}
string
"""

Would become:

my_tag((

       super,
 custom), raw=("super", "custom"), conv=(None, None), formatspec=(None, None), strings=("my ", " ", " string")
rmorshea commented 1 year ago

Shoot! The evaluation of the expressions would no longer be lazy.

rmorshea commented 1 year ago

The last way I can think of to preserve column offsets in tracebacks is by passing information about the location of expressions in the original source and using that to modify tracebacks which arise within the tag function itself.

my_tag("my ", (lambda: custom, "custom", None, None, 1, 16), " string")

Where the @tag decorator would modify my_tag by do something like:

def tag(func):

    def wrapper(*args, src_info=None):
        new_args = []
        for a in args:
            match a:
                case str():
                    new_args.append(a)
                case getvalue, src, conv, spec, *src_info:
                    getvalue = modify_tracebacks(getvalue, *src_info)
                    new_args.append((getvalue, src, conv, spec))
        return func(*args)

    return wrapper

def modify_tracebacks(getvalue, lineno=None, col_offset=None):
    if not (lineno and col_offset):
        return getvalue

    def wrapper():
        try:
            return getvalue()
        except Exception as error:
            # modify the traceback with the appropriate lineno and col_offset somehow
            error.with_traceback(...)

    return new_getvalue

If this worked, it'd mean that the transpiler wouldn't even need to worry about preserving line numbers in the areas of code it modified.

gvanrossum commented 1 year ago

IMO it's not worth worrying about column offsets for the initial prototype.

pauleveritt commented 1 year ago

@rmorshea If there's a way for me to join in with what you're doing and re-parent my stuff on your interim transpiler, let me know.

rmorshea commented 1 year ago

Will do. I might have time to create a repo for this tonight, but otherwise I won't be able to do much until next week.

jimbaker commented 1 year ago

@rmorshea while I like the syntax, it's problematic as I mention here (https://github.com/jimbaker/tagstr/issues/3#issuecomment-1426611011) - we need to preserve thunks because they give the control on interpolation.

rmorshea commented 1 year ago

@jimbaker the intention here is to transpile the tag @ f'...' syntax such that it conforms to the tag string spec as explained here. I think this could be a useful tool for us as we work on this PEP, but also as a way to backport tag strings to older versions of Python.

rmorshea commented 1 year ago

So, I managed to create a custom tagstr encoding. Unfortunately though, this doesn't play well with Black since it decodes the file before reformatting it. Thus, the version that gets saved is the transformed version, not the one the user authored. Anyone have ideas on how this could be avoided?

rmorshea commented 1 year ago

Ok, the hack I came up with involves stuffing the original source at the end of the file. The tagstr encoder then searches for the original source and returns that. This solves the problem of black saving the transformed text, but it doesn't allow black to do its job. The only way I can think of to work around this is to allow users to set an environment variable TAGSTR=off before running black.

It would be nice if there were a way to tell if the codec was running while formatting code so users didn't have to set the environment variable, but this works for now I suppose.

gvanrossum commented 1 year ago

Wow, the last time I used the encoding hack, things like Black weren't an issue...

I guess an import hook might be better.

rmorshea commented 1 year ago

Welp, it's published.

pip install tagstr

The hook expects there to be an import tagstr statement at the top of any file that should be transformed:

import tagstr

@tagstr.tagfunc
def func(*args):
    print(args)

name = "world"
func @ f"hello {name}!"

I'll work on adding an IPython cell magic so this can be used in Jupyter Notebooks/Lab. Not really sure if there's a similar way to inject the transformer into the standard Python REPL though.

rmorshea commented 1 year ago

I threw this together pretty quickly so there's definitely gonna be some bugs and rough edges.

pauleveritt commented 1 year ago

Quite interesting @rmorshea any chance you're at PyCon? I'm sprinting the first day.

rmorshea commented 1 year ago

Unfortunately I am not. Would love to participate remotely if that's possible. Feel free to email me: ryan.morshead@gmail.com

jimbaker commented 1 year ago

I will also be in person at the sprints through Monday afternoon. This will be a chance for me to get back into this work - I have been very busy with other things. Fortunately I feel like discussing another issue has started to page back into my mind what we have been trying to do here 😁

rmorshea commented 1 year ago

Published another release of the tagstr transpiler. Includes a number of fixes/changes:

I still feel like I'm doing something wrong in the import hook so I suspect there are probably other latent issues to be fixed.

It's also worth noting that the tagfunc decorator I used in the earlier example is not technically required. Rather, it exists purely to satisfy type checkers:

# tagstr: on
name = "world"
print @ f"hello {name}!"
hello  (<function <lambda> at 0x7f7a8586b920>, 'name', None, None) !
pauleveritt commented 1 year ago

Now that @rmorshea has published the transpolar, can this ticket get closed?

rmorshea commented 1 year ago

I think so