jimbaker / tagstr

This repo contains an issue tracker, examples, and early work related to PEP 999: Tag Strings
51 stars 6 forks source link

Example tags #8

Closed jimbaker closed 1 year ago

jimbaker commented 2 years ago

Some example tags we can implement. Let's update this list over time with linked issues and any new tags that might be useful.

gvanrossum commented 2 years ago

We could also have one that removes indentation from a triple-quoted string, like textwrap.dedent(), so you can write

from textwrap import dd
def main():
    code = dd"""
        def f(x, y):
            print(x, x+y)
    """
    eval(code)

instead of having to place the text flush left (breaking the visual indentation) or call textwrap.dedent() manually on the code.

ericsnowcurrently commented 2 years ago

We could also have one that removes indentation from a triple-quoted string, like textwrap.dedent()

This could be especially helpful when you have multi-line text you inserting. Here's an example with complex regular expressions:

VERSION = textwrap.dedent(r'''
    (?:
        ( \d+ )  # <major>
        \.
        ( \d+ )  # <minor>
        (?:
            \.
            ( \d+ )  # <micro>
         )?
        (
            ( a | b | c | rc | f )  # <level>
            ( \d+ )  # <serial>
         )?
     )
'''
DATE = textwrap.dedent(r'''
    (?:
        ( \d{4} )  # <year>
        -
        ( \d\d )  # <month>
        -
        ( \d\d )  # <day>
     )
'''
REGEX = re.compile(rf'''
    ^
    (?:
        (?:
            ( v )?  # <prefix>
            {VERSION}
         )
        |
        (?:
            {DATE}
         )
     )
    $
''', re.VERBOSE)

I've had to deal with this in a number of projects. Without any extra effort, the resulting patterns are harder to read, which matters a lot when you are trying to debug a regex. A textwrap.dd could easily apply the indent to the interpolated multi-line string.

Another interesting variation would be a similar tag in the re module, for verbose patterns like the ones above.

jimbaker commented 2 years ago

There's potential utility for a built-in textwrap.dd that does interpolations, since that's often why we have quoted code fragments - we are building out code for eval. One possible gotcha in usage - in the above example, DATE is compiling a regex that uses \d{4} (common for years, perhaps the most common use of quantifiers to specify repetitions). On the other hand, as we see with REGEX, the use of format-style '{}' interpolations in regexes is very common in the stdlib, especially in tests. This distinction seems to work fine in practice.

This is the nature of having metacharacters - we have to distinguish them from the code that's using them (whether Python, regex minilanguage, or \LaTeX). Doubling the braces to get the behavior in the underlying language, as opposed to interpolation, seems reasonable and hopefully everyone is used to by now in using f-strings.

jimbaker commented 2 years ago

@gvanrossum See https://github.com/jimbaker/tagstr/blob/main/examples/code.py, which implements a fairly minimal Python code templating tag, code. However, so far I haven't figured out good ergonomics for this tag. In any event, it can be used like so:

from code import code

def main():
    my_code = code"""
        def f(x, y):
            print(x, x+y)
    """
    exec(str(my_code), globals())
    # f is now available in globals
    f(1, 2)

main()

(Note that exec is needed here, not eval.)

It does do the autodedent, plus supporting interpolation. Interestingly it can do stuff like this, where we have code inserted into code. I think it's sufficiently white-space aware, although clearly Python is not a Lisp with lots of parentheses.

from code import code

y = 7

def main():
    my_code = code"""
        def f(x, y):
            print(x, x+y)
    """
    exec(str(my_code), globals())
    f(1, 2)

    some_code = code"""
        def outer(x):
            {my_code}
            f(x, y)
    """

    print(some_code)
    exec(str(some_code), globals())    
    outer(5)

main()

It feels like there's something potentially useful here for templating code. I would like to cover examples like the ones seen in dataclasses, such as in its _create_fn - https://github.com/python/cpython/blob/3.10/Lib/dataclasses.py#L412

gvanrossum commented 2 years ago

Okay, so the weird thing is that you have to write exec(str(my_code)) rather than just exec(my_code), so that the second example, where {my_code} occurs inside another code string. Maybe we can make that smoother by making the tag return a subclass of str? Then exec(code"...") will just work, but the subclass can be special-cased for interpolations.

I wouldn't get too tied to dataclasses here, they're stdlib magic and we're more concerned about performance there than about readability of the implementation. :-)

jimbaker commented 2 years ago

@ericsnowcurrently It would be interesting to implement a minimal re tag that could provide some useful templating for regular expressions. Maybe this is just the equivalent of fr and textwrap.dedent. Another thought is that there might be a useful tag for a subset of regular expressions. So this could be something to support tags like this one:

version = globish"{major:\d+}.{minor:\d+}(.{micro:\d+})?({level:a|b|c|rc|f}{serial:\d+})?"

as we can see with the zeroth implementation of globish:

def globish(*args: str | Thunk):
    print(args)

which results in something like the following:

((<function <lambda> at 0x101b00520>, 'major', None, '\\d+'), '.', (<function <lambda> at 0x101b02780>, 'minor', None, '\\d+'), '(.', (<function <lambda> at 0x101b02af0>, 'micro', None, '\\d+'), ')?(', (<function <lambda> at 0x101b01c80>, 'level', None, 'a|b|c|rc|f'), (<function <lambda> at 0x101b8d860>, 'serial', None, '\\d+'), ')?')

In particular, we can ignore getvalue in the thunk, and instead use raw and formatspec for some very creative DSL construction. 😁 In particular, it might enable unapply schemes in structural pattern matching, as well as recursive construction of the globish matchers.

jimbaker commented 2 years ago

Okay, so the weird thing is that you have to write exec(str(my_code)) rather than just exec(my_code), so that the second example, where {my_code} occurs inside another code string. Maybe we can make that smoother by making the tag return a subclass of str? Then exec(code"...") will just work, but the subclass can be special-cased for interpolations.

Yes, that sounds like a very good nice simplification in usage! Most tags will have a natural string representation and corresponding usage. (Additional methods for the rest.) Note that I had a workaround for this in the shell example, where I used the fact subprocess.run would work with an __iter__, but still accept shell=True, but subclassing str would have been better. I'm sure it will come up in other tags too.

I wouldn't get too tied to dataclasses here, they're stdlib magic and we're more concerned about performance there than about readability of the implementation. :-)

Hah. Now arguably there might a suitable code tag that could have a C implementation, and be even faster than the tuned dataclasses implementation. But we are geting ahead of ourselves!

gvanrossum commented 2 years ago

Maybe there could be a built-in ‘re’ tag that compiles a regex. Somewhere could do re”a.*z”.search(line). (Hm, that might look like line noise to some. Then again that is in regex’s nature. :-)

-- --Guido (mobile)

rmorshea commented 2 years ago

One that could be interesting would be a tag that parses an RST or Markdown string and produces a tree of docutils nodes. This could be useful for writing sphinx extensions. I ran into some cases where I found it easier to just write RST template strings than trying to figure out to construct the nodes properly myself. This strategy of just using normal string formatting turned out to be problematic when I introduced the Myst parser extension to my Sphinx project since that changes the underlying rendering machinery. Not something for the standard library, but useful nonetheless.

jimbaker commented 2 years ago

One that could be interesting would be a tag that parses an RST or Markdown string and produces a tree of docutils nodes...

This makes sense to me - basically if we have a DSL that would benefit from interpolation and/or recursive construction, it seems like the tag string support is actually quite nice. Let's work through an example!

Not something for the standard library, but useful nonetheless.

So far I don't think we have any tags that would be candidates for the stdlib. I was initially thinking we might need taglib in the stdlib, but given the simplifications that we reached at PyCon, it is so-far small and probably should evolve separately as a third-party library in PyPI. (Some functionality that should be added to it would include 1) template compilation; 2) caching support; 3) reconstruction of the entire raw string, including taking into account conv.)

jimbaker commented 2 years ago

I updated the example shell and code tags to use the marker string approach (subclass str using the usual approach). So this enables this usage, no str(...) required to exec:

from code import code

y = 7

def main():
    my_code = code"""
        def f(x, y):
            print(x, x+y)
    """
    exec(my_code, globals())
    f(1, 2)

    some_code = code"""
        def outer(x):
            {my_code}
            f(x, y)
    """

    print(some_code)
    exec(some_code, globals())    
    outer(5)

main()
jimbaker commented 2 years ago

Maybe there could be a built-in ‘re’ tag that compiles a regex. Somewhere could do re”a.*z”.search(line). (Hm, that might look like line noise to some. Then again that is in regex’s nature. :-)

I implemented the re tag in https://github.com/jimbaker/tagstr/blob/main/examples/linenoise.py

It's possible this might actually be useful! Although calling it re here in linenoise.py is a bit much I think 😀

ericsnowcurrently commented 2 years ago

The catch with a tag for regular expressions is that the exiting "r" prefix is usually important for regular expressions.

rmorshea commented 2 years ago

If it's correct to assert "\n".encode("unicode_escape").decode() == r"\n" (this particular example works, but I'm not sure about others), then there could be an rer tag which would make it as if the string were raw.

gvanrossum commented 2 years ago

The catch with a tag for regular expressions is that the exiting "r" prefix is usually important for regular expressions.

That's why the current proposal always passes the "raw mode" string to the tag function, i.e. rer"-\n-" is the same as rer(r"-\n-").

rmorshea commented 2 years ago

Maybe a path tag for pathlibcould be interesting? The tagged version seems a bit more readable to me:

some_dir = "dir1"
list_of_dirs = ["dir3", "dir4"]
something = "special.txt"

assert (
    path"/{some_dir}/dir2/{list_of_dirs}/this_is_{something}.txt"
    == Path("/", some_dir, "dir2", *list_of_dirs, f"this_is_{something}.txt")
)
gvanrossum commented 2 years ago

That example is only somewhat compelling because pathlib's / operator doesn't support list_of_dirs. But it looks like you could write it this way:

Path("/") / some_dir / "dir2" / Path(*list_of_dirs) / f"this_is_{something}.txt"

It's longer but such complicated paths are uncommon.

Though maybe the example would be more compelling if it referred to a real use case? Spam and ham aside, abstract examples don't speak to the reader as much as examples that look like something you'd actually use? E.g. an example using a Point class with an x and y coordinate is more interesting than one that uses MyClass with attr1 and attr2 attributes. (This is why database examples always used to be written using Employee tables. :-)

jimbaker commented 2 years ago

I added an example sql tag in https://github.com/jimbaker/tagstr/blob/main/examples/sql.py

So given

    table_name = 'lang'
    name = 'C'
    date = 1972

    names = ['Fortran', 'Python', 'Go']
    dates = [1957, 1991, 2009]

    with sqlite3.connect(':memory:') as conn:
        sql = Qmarkify(conn)
        sql'create table {table_name} (name, first_appeared)'
        sql'insert into {table_name} values ({name}, {date})'
        sql'insert into {table_name} values ({names}, {dates})'

it does the following:

This particular tag would need more work for real usage, but it seems possibly useful.

gvanrossum commented 2 years ago

Cool. It looks like it would have to have complete knowledge of SQL grammar to be able to decide how to do the quoting, right?

I am a little hesitant about sql'blah blah' executing the code. And the way to construct the sql tag on the fly so it incorporates the connection is a little odd (all these things violate the initial impression "this is a string literal with some extra stuff").

How would you do a "prepare" style command? I guess you wouldn't.

ericvsmith commented 2 years ago

I think the only way you'd want to use this is to build a parameterized query. For example, I can't imagine wanting to allow

x = 'e t'
sql'creat{x}able {table_name} (name, first_appeared)'

Most if not all database engines cache parameterized queries, so I'm not sure how useful "prepare" is any more.

jimbaker commented 2 years ago

Cool. It looks like it would have to have complete knowledge of SQL grammar to be able to decide how to do the quoting, right?

I don't believe so. See for example @ericvsmith 's example:

    with sqlite3.connect(':memory:') as conn:
        x = 'e t'
        sql'creat{x}able {table_name} (name, first_appeared)'

results in this syntactically invalid SQL:

creat"e t"able "lang" (name, first_appeared)

which raises this error, sqlite3.OperationalError: near "creat": syntax error.

Arguably this is a correct interpolation.

I am a little hesitant about sql'blah blah' executing the code.

For sure. This tag really gives access to all of SQL. So there's no need for a standard Bobby Tables SQL injection attack, this tag is directly enabling:

drop table {table_name}

and having the table_name be interpolated in (even if quoted). Why this madness? 😀 Because we sometimes want to write data definititions (DDL), not just data modification (DML).

A real tag built around this should make it clear what is being enabled, and where any specific interpolations can happen. I'm pretty sure the placeholder rewrite is safe because it uses the SQL values keyword to identify this location. TBD.

And the way to construct the sql tag on the fly so it incorporates the connection is a little odd (all these things violate the initial impression "this is a string literal with some extra stuff").

For this example tag, it might make more sense to have it support __str__ (so render insert into "lang" values (?, ?), creat"e t"able "lang" (name, first_appeared), etc), and then tie it in with a cursor/connection model. Regardless it seemed like an interesting way to implement the tag.

How would you do a "prepare" style command? I guess you wouldn't.

This concern certainly does not apply to SQLite. I'm not certain about any other SQL dialects at this point (I would agree with @ericvsmith in general on this, other than to think that SQL is so varied in its implementations).

jimbaker commented 2 years ago

I updated the example sql tag so that one can write code like this now:

  1. There's no longer any implicit cursor management done. However, it does require knowing when to use execute or executemany.
  2. sql and sql_unsafe tag functions are created so that we can choose whether to use with identifier quoting (unsafe) or placeholder only (safe).
def demo():
    table_name = 'lang'
    name = 'C'
    date = 1972

    names = ['Fortran', 'Python', 'Go']
    dates = [1957, 1991, 2009]

    with sqlite3.connect(':memory:') as conn:
        cur = conn.cursor()
        cur.execute(*sql_unsafe'create table {table_name} (name, first_appeared)')
        cur.execute(*sql_unsafe'insert into {table_name} values ({name}, {date})')
        cur.executemany(*sql'insert into lang values ({names}, {dates})')

        # FIXME time to write proper unit tests!
        # NOTE assumes that SQLite maintains insertion order (as it apparently does)
        assert list(cur.execute('select * from lang')) == \
            [('C', 1972),  ('Fortran', 1957), ('Python', 1991), ('Go', 2009)]

        try:
            cur.execute(*sql'drop table {table_name}')
            assert 'Did not raise error'
        except ValueError:
            pass

I also identified some additional changes that can be done, such as supporting recursive construction of the statement.

jimbaker commented 2 years ago

One rather cool thing is being able to see this, given the support for the raw expression text:

ValueError: Cannot interpolate 'table_name' in safe mode
jimbaker commented 2 years ago

I completely revamped the sql tag example such that identifiers have to be explicitly marked.

So this looks like the following:

cur.execute(*sql'create table {Identifier(table_name)} (name, first_appeared)')

This was motivated in part because if one looks at the sqlite3 reference manual, it becomes apparent that there is no easy way to identify where a placeholder vs an identifier could go, without writing a parser for the SQL dialect. While there are a few SQL parsers on PyPI that potentially be used here, with suitable interpolation points being placed to work with them, I didn't want to rely on them for the PEP.

I also removed the support for executemany since is problematic with the approach I'm trying here to support subqueries (use temporary tables presumably for such things).

jimbaker commented 1 year ago

Closing this out. We have written some interesting examples. Future work can look at internationalization, Latex, etc, in separate issues.