Expression quoting - Githubissues

jimbaker / fl-string-pep

PEP to add fl-string support to Python 3.10

0 stars 0 forks source link

Expression quoting #6

Open gvanrossum opened 4 years ago

gvanrossum commented 4 years ago

Another radical idea.

Define a Quote class that represents a quoted expression. It gets passed the string representation of the expression and a lambda that evaluates it in the original context (though perhaps the walrus assigns to a variable in the containing scope). Like in #4, the lambda can be used to recover the cellvars referenced by the expression (and the globals).

Now if we write foo{a+1} this constructs q = Quote('a+1', lambda: a+1) and then calls foo.__qcall__(q). If we write foo{x+1, y-1} it would construct two Quotes (for x+1 and for y-1) and then call foo.__qcall__(q, r) with those two. Etc.

Next we could allow an alternate function definition, written as

def foo{x, y}:  # Curlies instead of parens!
    ...

This would just be a shorthand for defining a class foo with a __qcall__ method whose argument list is (self, x, y) and whose body is exactly the body of foo above.

Now we can write clever functions like

def qmap{qexpr, qarg}:
    arg = qarg()
    for x in arg:
        yield apply{qexpr, x}

# Call example:
print(list(qmap{x+1, range(10)}))
# Output: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

We need a builtin apply() that takes an expression and argument (both quoted) and somehow calls the expression on the argument. This would use the co_cellvars trick again. The expression would have one free variable that should correspond to the name of the argument. (So the x in for x in arg must correspond to the x in x+1.)

This would satisfy some of the desire to have functions that see their arguments as quoted expressions -- the problem has always been that there's no way that the parser can know that a function needs its arguments quoted, which we solve here by using {} instead of () for the call syntax. (Luckily Python uses a[...] and a(...) but not yet a{...}. :-)

That's as far as I got before family interrupted today.

jimbaker commented 4 years ago

(First, there's a typo in qmap's definition. Old habits!)

I do like the symmetry here. With this addition, we would have 3 calling schemes in Python:

Call-by-reference. The usual.

Call-by-index. Likely not normally thought of as such, but it does apply eager evaluation to the desired slice args, then made available as a slice object to __getitem__ - in other words, not so dissimilar to what is being proposed here. In any event, this support has seen some nice creative usage since slicing supported strides :). Example from NumPy, modestly modified from the docs:

>>> import numpy as np
>>> x = np.array([1., -1., -2., 3, 4, -4]) 
>>> x[x > 2]

results in

array([3., 4.])

given that x > 2 is array([False, False, False, True, True, False])

Unfortunately, one cannot then do some obvious things like

>>> x[1 <= x < 3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

because that's an implicit use of and. Instead in NumPy, this has to be rewritten as x[(1 <= x) & (x < 3)], with additional parentheses due to operator precedence. In addition, both the binary arrays 1 <= x and x < 3 have to be computed, even if there were no values greater than 1 in x - there's no ability to support short-circuit evaluation (at least in this current immediate execution model of NumPy; but see also https://www.mathworks.com/help/matlab/ref/logicaloperatorsshortcircuit.html).

Call-by-name. So with the above, NumPy could instead define __qcall__ for array objects, and get the desired behavior with x{1 <= x < 3}. If this is running in PySpark, the expression can be parallelized across the cluster automatically. Etc. It would be a useful exercise to see what possible ergonomic simplifications are possible for such popular tools. Example: in https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html there are examples of code usage that possibly could be simplified in this fashion.

So at first glance, this concept works for me!

It would be interesting to think through a recursive qdef, and whether that's useful or not - recursive generators certainly are.

gvanrossum commented 4 years ago

Yeah, that's what I was going for. The big question is if having the free variable of the quoted expression be implicit will work.

I'm thinking about how to define apply{} and am coming up with the following (where apply___qcall__ is the __qcall__ that's generated by def apply{...}):

def apply___qcall__(qqfunc: Quote, qarg: Quote) -> Any:
    qfunc = qqfunc()  # Quote('x+1', lambda: x+1)
    var = qarg.raw  # 'x'
    f = ??  # extract free variables from func and construct a new lambda with one argument named x
    arg = qarg()  # 0
    return f(arg)  # 1

I don't think I can do the step f = ?? in general pure Python -- this will be a new primitive we'll have to add to the Python VM and make accessible through a builtin (maybe hidden in sys or someplace). Consider if the expression is x+y instead of x+1 -- from the argument-less lambda's POV x and y are both free variables, but we want to turn this into code that changes x into a parameter but preserves the scope of y (either a cell or a global, as the case may be).

Anyway the idea would be to combine this with something like PEP 501's i-string objects, but with each of the interpolations presented as a Quote object instead of a raw string and a lambda. (The format spec and conversion still need to be given separate somehow, and of course the fixed strings in between interpolations too.)

Translation-marked strings would have to reconstruct the raw string by combining the fixed strings, the raw attributes of the interpolations, and the format spec and conversion into the "original" string to be looked up. (Unfortunately there are some edge cases, e.g. I don't believe string.Formatter().parse() can distinguish between "foo{bar}" and "foo{bar:}", since it returns an empty string as the format_spec in both cases.)

Thinking about it more, the compiler may have to produce something slightly more complicated than a parameter-less lambda, in order to make the f = ?? step more tractable. (It would also help with the scope of walrus operators occurring there.)

jimbaker commented 4 years ago

Sounds good in terms of next steps exploring this idea - apply___qcall__ makes much clearer the intended mechanism. I did find it rather cool that in #4 a reasonable default implementation of a tagged string/expression was in fact perfectly adequate as a symtable - no changes needed there, and compatible with at least Jython and other alternatives that attempt to be highly compatible. However, we have added other attributes to the code object, and the recent support in 3.8 for the replace method at least provides backwards compatibility for programs doing this code introspection.

So let's see what is actually needed.

jimbaker commented 4 years ago

If I understand it correctly, f is an implementation of lambda lifting (https://en.wikipedia.org/wiki/Lambda_lifting). If we can figure out the semantics, it might make sense to have functools.lift as the inverse function (more or less) of functools.partial.

import types
import functools
from dataclasses import dataclass
from typing import Callable

@dataclass
class Quote:
    raw: str
    function: Callable

    def __call__(self, *args, **kwargs):
        return self.function(*args, **kwargs)

    # NOTE https://en.wikipedia.org/wiki/Lambda_lifting seems to be the prior
    # art here. However, "lift" instead of "lambda lift" also somewhat different
    # meaning in functional programming. By analogy to functools.partial, which
    # has similar concerns if we were to survey FP, we will just call it "lift".
    #
    # Another name might be "uncurry"
    def lift(self, *varnames: str) -> "Quote":
        """Lambda lifts varnames to the embedded lambda function, return a new Quote

        Any varname that is not used in the freevar is ignored, as it would be
        in any usual code.

        Because of the compilation, this is a relatively expensive operation in
        pure Python, so it should be cached (outside of this specific object of
        course). However, it should be possible to optimize this in a specific
        scenario, at least for CPython and Jython.
        """

        def param_list(names):
            return ", ".join(names)

        code = self.function.__code__

        # FIXME do more generic patching - it should be possible to do the
        # following for some function `f`:
        #
        # 1. extract co_freevars from f.__code__
        # 2. make a simple lambda that references these freevars
        # 3. use like below
        # 4. then patch back in the original code object
        #
        # Such a function could then be `functools.lift`, and then just used
        # here for the specific implementation in Quote.

        wrapped = f"""
def outer({param_list(code.co_freevars)}):
    def lifted({param_list(varnames)}):
        return (lambda: {self.raw})()
"""
        capture = {}
        exec(wrapped, self.function.__globals__, capture)

        # Essential for tracing out what is going :)
        # import dis; dis.dis(capture["outer"])

        lifted = types.FunctionType(
            capture["outer"].__code__.co_consts[1],
            self.function.__globals__)
        functools.update_wrapper(self.function, lifted)

        return Quote(self.raw, lifted)

def test_scope():
    x = 47
    q = Quote("x+1", lambda: x + 1)
    print(f"{q()=}")
    q_x = q.lift("x")  # lift x out of the free var
    print(f"{q_x(42)=}") 
    q_x2 = q_x.lift("x")  # ignore extra x in lifting again
    print(f"{q_x2(42)=}")
    try:
        # but we cannot lift the same variable twice in one lift, likely we want
        # some wrapper exception here
        q_xx = q_x.lift("x", "x")
    except SyntaxError:
        pass  # SyntaxError: duplicate argument 'x' in function definition
    q_xy = q_x.lift("x", "y")  # y is not used in the body, but that can be true of any function
    print(f"{q_xy(42, 99999)=}")

test_scope()

gvanrossum commented 4 years ago

Still wrapping my head around this. But don't you have the arguments to functools.update_wrapper() reversed?

jimbaker commented 4 years ago

Quite possibly, they don't actually change anything I believe in this more limited example!

I added this wrapping support, thinking I might attempt the more generic lift function, then decided to 1. Show intermediate work; 2. Take a long walk. :)

gvanrossum commented 4 years ago

Okay, you definitely reversed those arguments. I pushed a slightly altered version that also fixes this, as lifting.py.

Now I have a real question. Suppose we have x+y and y is a global. Should we consider y a free variable? (Globals include classes, functions, and anything imported.) What if y is a builtin?

I feel this is important once we start doing things like qmap{x+1, a}. How do we determine which is the free variable in qmap{n**x, a}, n or x? Or do we just make it x by convention? I would be okay with that if qmap was defined by some 3rd party package, but not if it was built in -- the use of x as the "unknown" is culturally determined and will look wrong in some contexts.

jimbaker commented 4 years ago

Now I have a real question. Suppose we have x+y and y is a global. Should we consider y a free variable? (Globals include classes, functions, and anything imported.) What if y is a builtin?

In the current example, we don't consider any globals (including builtins) as free variables. I believe we should keep that distinction. We could potentially add something like functools.pushdown, and do the same sort of symtable manipulation, but I don't know if that's really useful or not.

I feel this is important once we start doing things like qmap{x+1, a}. How do we determine which is the free variable in qmap{n**x, a}, n or x? Or do we just make it x by convention?

I see your point. I had assumed that qmap would be passed the desired variable to lift, but that's intended to be more of a convention/only one freevar in what you wrote earlier. It doesn't solve figuring out how to lift n**x.

Right now, I think the only real solution here is to be explicit and state the freevar(s) to lift:

qmap{x, n**x, a}

If we added some further syntactic sugar (here : is simply an alternative way of writing ,), maybe that could be

qmap{x: n**x, a}

In either case, qmap would be defined like so:

def qmap{qparams, qexpr, qarg}:
    arg = qarg()
    for x in arg:
        yield apply{qparms, qexpr, x}

I would be okay with that if qmap was defined by some 3rd party package, but not if it was built in -- the use of x as the "unknown" is culturally determined and will look wrong in some contexts.

Alternatively we could always use _. See anonymous functions in Scala for example: https://twitter.github.io/effectivescala/#Collections-Style (OK, maybe not, this is not the easiest part to understand about Scala - after 4 years of not actively using Scala, I have to think about it again myself). Still in this case it could work:

print(list(qmap{_+1, range(10)}))

Personally, I'm not so sure if that's such a good approach.

jimbaker commented 4 years ago

With respect to my earlier comment, this can be more straightforward and without extra syntactic sugaring, but with the same user syntax:

foo{x: x+1}

is the equivalent of

foo.__qcall__(Quote(lambda x: x+1, "x: x+1"))

With this modest modification, scope analysis works as expected:

x is a parameter of the quote, without needing lifting. Additional parameters are available, since this is just a normal lambda.
The raw string is still available for processing.
Free variables are analyzed correctly, with respect to global or nested scopes; local variables (including walrus operators), work as expected; function parameter variables providing a dividing line between the inner and outer scopes.

What I like about this approach is that it maintains the original idea of #4 of using existing Python scope analysis, and then using the resulting symbol table captured it in the function (and corresponding code object).

It is also still possible for someone to write qmap{x, n**x, a} (or permuted as desired). In current Python, I can write lambda: n**x and this will compile just fine without n or x being defined (and of course equivalently to a function defined with def):

>>> f = lambda: n**x
>>> dis.dis(f)
1           0 LOAD_GLOBAL              0 (n)
             2 LOAD_GLOBAL              1 (x)
             4 BINARY_POWER
             6 RETURN_VALUE

But calling f will fail if n and/or x are not defined as names.

jimbaker commented 4 years ago

With respect to tagged strings, it would be still desirable I think to have tag"{expr!conv:formatspec}" still be usable. My feeling is that the difference in syntax is different enough that this is fine. After all, : is rather overloaded in Python as it is, so we already use context to determine what it means.

gvanrossum commented 4 years ago

I agree we should also have tagged strings -- as a special case of quoted expressions.

I don't like foo{x: x+1} much -- if we're doing that we might as well try to come up with a shorter lambda notation. And : is really too overloaded.

The rest has to wait until my tendonitis calms down.

jimbaker commented 4 years ago

I agree we should also have tagged strings -- as a special case of quoted expressions.

+1 - I plan to go through tagged strings/numbers again, and re-express accordingly in terms of quoted exprs.

I don't like foo{x: x+1} much -- if we're doing that we might as well try to come up with a shorter lambda notation. And : is really too overloaded.

We could use arrow notation, somewhat similar to JavaScript, Scala, and other languages. So x -> x+1 is the same as lambda x: x+1. The idea here is that we use the analogy of a typed equivalent function with def, where we remove parens for the parameter list (as with lambda), and there is no need for : to introduce the function body suite (since inline). -> is therefore left to introduce the body in this scheme. (Alternatively we can use more of the syntax of double arrows as seen in JS.)

One advantage is that this notation can also support type annotations: x:int -> x+1. The return type can also be specified, if less elegantly: x:int -> (x + 1):int. However, many arrow functions can have their return types readily inferred from their parameter types, since branching logic tends to be rare in such code.

Such arrows could then be a more elegant way of typing anonymous functions (https://github.com/python/mypy/issues/4226) than what is necessary now, as we see with this workaround: https://stackoverflow.com/questions/33833881/is-it-possible-to-type-hint-a-lambda-function. In particular, such usage is not compatible with direct usage of the lambda, since it requires an assignment!

The rest has to wait until my tendonitis calms down.

Hope you feel better! I'm experiencing something similar in my right foot for the last 3 months. Slow, slow recovery.

gvanrossum commented 4 years ago

That stackoverflow item is ridiculous. The deleted/hidden answer actually got it right: the solution to typing a lambda is to not use a lambda but a def. The upvoted answers only work if you assign the lambda to a variable, making it not anonymous. IOW I don't think we should try to add types to anonymous functions.

jimbaker commented 4 years ago

That stackoverflow item is ridiculous.

Hah, that is true.

The deleted/hidden answer actually got it right: the solution to typing a lambda is to not use a lambda but a def. The upvoted answers only work if you assign the lambda to a variable, making it not anonymous.

Right.

IOW I don't think we should try to add types to anonymous functions.

Also, a more powerful type inferencer could do this work for the programmer, instead of requiring type annotations, given the fact that there is available context. (It's true of certain FP languages, but they are also very different than Python!)

So let's put aside type annotations for anonymous functions, and just focus on finding a possible better syntax for lambdas. Also it's possible I didn't solve it here, since do we need to consider noargs or arrows more generally outside of quoted usage? (If quoted, this can be ignored, it's just foo{a + b}, where a and b are of course freevars. Such is the advantage of defining new syntax!)

So let's assume we have arrows in general. Then we could write the following, ignoring equivalent construction with a list comprehension:

list(map(x -> x+1, range(20)))

which is the same as

list(map(lambda x: x + 1, range(20)))

But for noargs arrows, would we write it like so?

f(-> some_freevar+1)

Or the following, somewhat similar to JS?

f(() -> some_freevar+1)

(My preferred variant.)

Or just require lambda for this case?

f(lambda: some_freevar+1)

gvanrossum commented 4 years ago

Noargs functions seem somewhat important (in a sense every interpolation in an fl-string would be a noargs function).

I think we can make plain f(-> x+1) work, although it does look odd. Assuming a twoarg is written (x, y) -> x+y, maybe () -> 42 would be better.

But we seem to be getting off-topic. Why do we need a new syntax for lambdas again?

jimbaker commented 4 years ago

... Why do we need a new syntax for lambdas again?

We can avoid a new syntax as follows:

foo{a+1} is expanded to foo.__qcall__(Quote('a+1', lambda: a+1)), as above. This is a tagged expression, where foo is a tag.
fum{lambda x: x+1} is equivalent to fum.__qcall__(Quote('x+1', lambda x: x+1)). So the lambda keyword is only necessary when one wants to ensure that one or more names are parameters, instead of being free variables. Also the raw string is just the body of the lambda, since params are available from the code object. Alternatively this could be written as fum{x -> x+1} or bar{x, y -> x*y}, etc, if we do adopt arrow notation for at least quotes. This also avoids the awkwardness of foo{-> x}.
For tagged strings: knightly'My name is {name}' is equivalent to knightly.__qcall__(Quote(r'My name is {name}' , lambda: name)). Raw strings are always used for tagged strings, although they are unnecessary for tagged expressions/numbers (since they have to be valid Python!). We could alternatively have it be knightly.__qcall__(Quote(r'My name is {name}' , lambda: f'My name is {name}') to support likely the most common case, but lambda: name is sufficient to recover the freevars.
For tagged numbers: 1.1d is equivalent to d.__qcall__(Quote('1.1', None)).

For the above there are two other concerns, which should be addressed by separate issues:

We likely need some idea of a CallSite, which allows the binding to be one-time for a given function/module scope; this is what I dubbed thunks in #4. But I will put that idea in a separate issue. This is an optimization - it doesn't change the semantics of the above.
Tags should always be statically analyzable in terms of origin (defined in this module, or explicitly imported), as we discussed earlier with t-strings for i18n.

gvanrossum commented 4 years ago

I worry we've gone off the rails a bit. Let's look at use cases again and try to see the minimal feature set we'll need for them.

I'm aware of these use cases for tagged strings:

Delayed evaluation of f-like-strings for logging.info(fl"Label: {expensive_call()} etc."). Here we just need an object that renders a string when you call or str() it.
Translation, e.g. tagging strings so we can write translatable strings as _"Balance: {balance}, please pay by {date}". Here we need the full original string (for use as a key in the translation database), and a mapping from interpolation strings to values to be interpolated (but not yet converted to strings, in case the locale's conventions for formatting dates or money need to be followed).
Shell quoting, e.g. sh"git add {filename}" becomes ["git", "add", filename]. Here we need the string pieces and the interpolated values after fomatting. (I guess SQL quoting is similar.)

Now for quoted expressions the only use case so far that I've seen is a slightly more concise form of lambda, possibly with an implicit free variable. That seems hardly worth new syntax. But what makes this attractive would be if it was introspectable so that e.g. a numpy library could compile things like x + y if z > 0 else x - y into superfast machine code once it knows the types of arrays providing the values for x, y and z. (Or maybe it should be x[i] + y[i] if z[i] > 0 else x[i] - y[i].) But for that example, we probably don't need thunks or the original scope -- in fact this could be done today by just putting the expression in string quotes (the library could easily parse it into an AST and then walk that to compile it).

Apart from the tagged numbers (which are cool but also probably don't have many serious applications besides Decimal?), what problem are we trying to solve here? Is there really a unifying principle for tagged strings, tagged numbers and quoted expressions? The solutions we're looking at all look rather complicated and I'm not so keen to have a new anonymous function syntax that's just a few letters shorter than lambda.

Did I miss something exciting?

jimbaker commented 4 years ago

I worry we've gone off the rails a bit. Let's look at use cases again and try to see the minimal feature set we'll need for them.

I'm aware of these use cases for tagged strings:

Delayed evaluation of f-like-strings for logging.info(fl"Label: {expensive_call()} etc."). Here we just need an object that renders a string when you call or str() it.

Translation, e.g. tagging strings so we can write translatable strings as _"Balance: {balance}, please pay by {date}". Here we need the full original string (for use as a key in the translation database), and a mapping from interpolation strings to values to be interpolated (but not yet converted to strings, in case the locale's conventions for formatting dates or money need to be followed).

Shell quoting, e.g. sh"git add {filename}" becomes ["git", "add", filename]. Here we need the string pieces and the interpolated values after fomatting. (I guess SQL quoting is similar.)

Now for quoted expressions the only use case so far that I've seen is a slightly more concise form of lambda, possibly with an implicit free variable. That seems hardly worth new syntax. But what makes this attractive would be if it was introspectable so that e.g. a numpy library could compile things like x + y if z > 0 else x - y into superfast machine code once it knows the types of arrays providing the values for x, y and z. (Or maybe it should be x[i] + y[i] if z[i] > 0 else x[i] - y[i].)

+1, or for simplifying computations in general. It's can be challenging to avoid eagerness while still providing a nice API, and that can be costly if the intermediate computations are unneeded or can be simplified. If we can get it right, we can provide such libraries a straightforward and standard scheme to get laziness, while providing nice ergonomics.

What I like about this is that Python is widely used as a coordination language for scientific computing, data science, and ML. Quoted expressions can improve this support for coordination. Again, if we get it right :)

But for that example, we probably don't need thunks or the original scope -- in fact this could be done today by just putting the expression in string quotes (the library could easily parse it into an AST and then walk that to compile it).

So two concerns here:

Many popular libraries avoid using string quoting, preferring to work with expression trees, such as the indexing seen in Numpy: x[x > 5], where x is an array.
Even if the library were to support string quoting, it needs to do scope analysis to find the variables in the callee scope that is using the string quoting. From what I have seen, this means supporting dynamic scoping via sys._getframe, instead of lexical scoping.

I looked at the following popular libraries that work with complex expressions in some interesting way, generally by parsing and/or building a computation graph. I did this by looking for usage of sys._getframe (given the absolute size of these codebases!):

Numpy - I would have thought it possible for Numpy to implement its indexing support without sys._getframe, but that is not the case. See https://github.com/numpy/numpy/blob/master/numpy/lib/index_tricks.py#L316. It's also used in other places for similar integration needs, it seems.
Pandas - Uses scope analysis extensively, and with a fair amount of sophistication: https://github.com/pandas-dev/pandas/blob/master/pandas/core/computation/scope.py. But not without at least some issues: https://stackoverflow.com/questions/47138394/scipy-optimisation-crashes-with-pandas-error I would have to dive more into Pandas to better understand what it does here.
PySpark - used via Pandas
PyTorch - no use of the frame except from C for its "tape execution" model (rather core part of its autodifferentiation support). This is reasonable, and certainly doesn't have anything to do with scope analysis. I find the lack of scope analysis interesting because PySpark effectively clones the API of Numpy, but it does in such way that allows for significant "reflection" - so as to optimize running on CUDA, etc. These two aspects are no doubt related!
Sage - I don't quite understand the computation model that Sage is supporting, with respect to get_main_globals, however, this is confined to a specific package supporting working with the Simplex algorithm. Otherwise minor usage.
SqlAlchemy - not used

So if quoted expressions are going to be useful, the best targets would seem to be to look at Numpy and/or Pandas, ideally in some simplified means so that it can be discussed without saying "understand the source code of this large codebase."

Apart from the tagged numbers (which are cool but also probably don't have many serious applications besides Decimal?),

FWIW - the only usage in Python stdlib for decimal literals is in tests. But surely they are used elsewhere? Interestingly, when I looked at the top 4000 PyPI packages plus related libraries (strictly looking at package names, since metadata is not provided), I found numpy-financial, stripe-python (payment processing), yfinance (Yahoo finance API), and starkbank. Only numpy-financial supported decimals at all, and decimal literals were only used in tests. Decimals should be supported by such packages, but it might not be the highest priority either.

Most importantly - the CallSite stuff I mentioned seems only applicable to decimal support in tight loops. And local assignment fixes any performance gaps if that were the case (so it becomes LOAD_FAST instead of LOAD_CONSTANT). Right now, this is YAGNI.

what problem are we trying to solve here? Is there really a unifying principle for tagged strings, tagged numbers and quoted expressions? The solutions we're looking at all look rather complicated and I'm not so keen to have a new anonymous function syntax that's just a few letters shorter than lambda.

Let's not do arrows with this work, since they are not necessary. Quote does seem unifying for tagged strings and quoted expressions.

Did I miss something exciting?

Not yet, more work needs to be done here! I think it might be worthwhile to write the core of a simplify qfunc for array indexing, just to see if that would be useful or not with the proposed syntax and dunder protocol.

gvanrossum commented 4 years ago

Oooh, I realize I'm totally out of my depth here, never having looked at any of those libraries (well, a brief failed encounter with pandas excluded). If you want me to participate you're going to have to pick examples that don't use any of those libraries -- surely the same principles can be applied to other domains?

I'm imagining the places where quoted expressions are more useful than the typical approach that builds an expression tree using operator overloading in situations where operator overloading doesn't exist, e.g. x < 1 or x >= 2 (which has to be written as (x < 1) | (x > 2) and 1 <= x < 2 (similar). That seems pretty minor (people are used to those rewrites by now) but perhaps we can offer better ergonomics because there's no need to use a variable of a special type to start the overloading (e.g. if numpy supports x[x < 1] then somehow the numpy type of x must have a __lt__ method that returns an expression tree). We can do this for any variable, and we can include a function that retrieves its value. (However, there must be something I'm missing -- the numpy docs indicate that x < 1 returns another numpy array with the elementwise boolean result of the comparison.)

Hm, I know that my first draft of Quote just has a string and a lambda, e.g. Quote{x+1} becomes Quote('x+1', lambda: x+1). (Roughly.) But maybe we can include a dict mapping variables to lambdas, e.g. Quote{x+y+1} somehow incorporates {"x": lambda: x, "y": lambda: y} (it could still have something that calls lambda: x+y+1 as well). There could even be metadata pointing to the right frame? (Hm, probably not, though perhaps it could name the function or something.)

PS. I did look at the getframe usage in numpy you linked to and it actually looks like it's only used when the array is indexed with a string (see docs). This is implemented using numpy.bmat, which also uses getframe. FWIW I couldn't follow the pandas usage at all.

jimbaker commented 4 years ago

Oooh, I realize I'm totally out of my depth here, never having looked at any of those libraries (well, a brief failed encounter with pandas excluded). If you want me to participate you're going to have to pick examples that don't use any of those libraries -- surely the same principles can be applied to other domains?

Another thought: look at C# (expression trees, trying to find the best docs) and Julia (https://docs.julialang.org/en/v1/manual/metaprogramming/), which both have robust support for metaprogramming to do interesting things. This keeps it closer to the language level, vs diving into complex libraries. For me personally, I'm just an occasional user of some of these libraries myself, not a developer of them!

I'm imagining the places where quoted expressions are more useful than the typical approach that builds an expression tree using operator overloading in situations where operator overloading doesn't exist, e.g. x < 1 or x >= 2 (which has to be written as (x < 1) | (x > 2) and 1 <= x < 2 (similar). That seems pretty minor (people are used to those rewrites by now) but perhaps we can offer better ergonomics because there's no need to use a variable of a special type to start the overloading (e.g. if numpy supports x[x < 1] then somehow the numpy type of x must have a __lt__ method that returns an expression tree).

That would be the optimized way to do it...

We can do this for any variable, and we can include a function that retrieves its value. (However, there must be something I'm missing -- the numpy docs indicate that x < 1 returns another numpy array with the elementwise boolean result of the comparison.)

... but this is how it is actually done in Numpy, with eager evaluation. So it's less than ideal, and results in the behavior we see of requiring writing expressions like (1 < x) & (x < 10) - and evaluating the boolean arrays 1 < x, x < 10, and then their combination with &.

Hm, I know that my first draft of Quote just has a string and a lambda, e.g. Quote{x+1} becomes Quote('x+1', lambda: x+1). (Roughly.) But maybe we can include a dict mapping variables to lambdas, e.g. Quote{x+y+1} somehow incorporates {"x": lambda: x, "y": lambda: y} (it could still have something that calls lambda: x+y+1 as well). There could even be metadata pointing to the right frame? (Hm, probably not, though perhaps it could name the function or something.)

Maybe, I need to wrap my head around this. In general, I think we should assume a minimal Quote setup, then determine methods on it such as subbing in code around variables (what I was using callbacks for earlier). Right now everything seems to go back to: we need the raw text (because we don't want to do inspect), and we need the symtable.

PS. I did look at the getframe usage in numpy you linked to and it actually looks like it's only used when the array is indexed with a string (see docs). This is implemented using numpy.bmat, which also uses getframe. FWIW I couldn't follow the pandas usage at all.

Right, Pandas has some amazing capabilities, but it feels quite challenging to work through.

jimbaker commented 4 years ago

So a few more thoughts here:

Quote seems to be generally useful, and compares nicely to Julia's concept of quote and C# of Expr.
Call-by-name via implicit quoting with f{a, b} is powerful. But I don't see the need for def f{a, b} - we could just have a regular function that implements a __qcall__ method, then resolves any arguments as needed by calling them.
However, another crazy idea - what if we supported quoted functions?

So this might look like:

quoted def qf(a, b):
    return a*b + 1

So far this is just like saying Quote(r'...', qf), except I can do this for a regular function (or an async or generator), have it decorated (although such decorators would need to be Quote-aware), and get scope analysis applied to it. Let's continue this idea:

def g(left, right):
    quoted def qf():
        return {left} + {right}
    return qf

Now the thought here is that maybe I could call g{left_expr, right_expr}, and this could sub these expressions into qf (very carefully of course!) and build up a larger function. Maybe it should be tagged with some h function to describe the careful substitutions (or other transformations):

def g(left, right):
    quoted def qf():
        return h{left} + h{right}
    return qf

As an idea, it's not terribly worked out, other than to say it vaguely resembles quasiquotes in Lisp. Why wouldn't I just use exec on some f-string template, like is already done (either with examples like lifting or https://github.com/python/cpython/blob/master/Lib/dataclasses.py#L377). Maybe I need some sort of looping or recursion to build the template? But I feel like there's something here. TBD.

gvanrossum commented 4 years ago

I'm sorry, I lost track of what you mean with quoted def. And your example uses {left} and {right} which to me look like sets of one element. What operation do you exactly intend that to do? And what would the + do?

The resemblance with Lisp quasiquotes is probably intentional -- it's been forever since I looked at Lisp or Scheme but I recall something that passes an s-expression unchanged rather than treating it as an expression to evaluate. Maybe '(foo bar)?

I can't find much about Julia or C# (Julia seems to be focused on on defining functions that run while the compiler is compiling -- IIRC we (i.e., I :-) rejected that idea earlier.

gvanrossum commented 4 years ago

Okay, in order to make this more concrete, I started a branch where we can work on implementing f{args}.

https://github.com/gvanrossum/cpython/tree/quoting

So far it only supports calls with exactly one argument, and it translates that argument into the stringified expression (so no Quote object yet).

Proof it works:

>>> def foo(arg):
...   print("The argument is", repr(arg))
... 
>>> foo(2+2)
The argument is 4
>>> foo{2+2}
The argument is '2 + 2'
>>> foo{x+1}
The argument is 'x + 1'
>>>

The next challenge will actually be changing it so that instead of calling f('x+1') it calls f(Quote("x+1", lambda: x+1)). I'm not sure how to synthesize that lambda yet. But it's probably a simple matter of creating a new AST node that starts with lambda: and ends with the argument.

I figure extending this to support multiple positional arguments will be straightforward. I'm not sure what to do with f{*args} so for now that's an error. I think I'll leave keyword arguments for much later (say when we're writing a PEP :-).

jimbaker commented 4 years ago

I'm sorry, I lost track of what you mean with quoted def. And your example uses {left} and {right} which to me look like sets of one element.

Yikes, mixing in sets obviously wouldn't work so well! :) I was a bit too close to thinking about f-strings when I was writing down the example. And without {...} support, I need to do an explicit variable substitution.

What operation do you exactly intend that to do? And what would the + do?

Addition of two expressions.

So let's just assume it's like so (again, just trying out ideas here):

x = 2
y = 3

def make_sum(left_expr: Quote, right_expr: Quote) -> Callable:
    quoted def qf(factor: int) -> int:
        return (left + right) * factor
    # As used here, Quote.sub inserts in the expr so that it's like
    # ... left() + right() ..., as opposed to requiring
    # it to be called. There are probably some interesting alternatives
    # that could support recursion. TBD.
    return qf.sub(left=left_expr, right=right_expr)

f = make_sum{x+1, x*y}
print(f(2))

This is roughly equivalent to writing

x = 2
y = 3

def make_sum(left: str, right: str) -> Callable:
    code = f'def qf(factor):\n return (({left}) + ({right})) * factor'
    ns = {}
    exec(code, globals(), ns)
    return ns['qf']

f = make_sum("x+1", "x*y")
print(f(2))

except the following hold:

left and right are syntactically correct Python expressions once they are wrapped as Quote objects. Such quotes can be constructed statically, or dynamically with with eval or exec.
Scope analysis works with these expressions. So quoted functions could be used to build the outer scope part of the lambda trick. (I need to spend more time on this idea.)
As an implementation detail, a Quote object might retain the AST, to speed up the composition (assuming this is in fact the case). For example, if the expression is just a body, it can directly sub that in, no call required.

I need to work out how this can potentially help with some of the metaprogramming seen in the stdlib.

The resemblance with Lisp quasiquotes is probably intentional -- it's been forever since I looked at Lisp or Scheme but I recall something that passes an s-expression unchanged rather than treating it as an expression to evaluate. Maybe '(foo bar)?

'(foo bar) does in fact quote this expression, so the function foo is not evaluated with respect to its arg bar. Probably easier to read it this way: '(+ 1 2) does not perform the addition. So this more or less equivalent to saying lambda: 1+2 in Python.

A quasiquote mixes evaluation in by using ,:

`(+ 1 2 ,(* 3 4))

is equivalent to

'(+ 1 2 12)

So the quasiquote like quality here is not as much as I initially intended, because we don't have any special syntax to help here - we need the explicit substitution provided by Quote.sub (a new method).

I can't find much about Julia or C# (Julia seems to be focused on on defining functions that run while the compiler is compiling -- IIRC we (i.e., I :-) rejected that idea earlier.

:) Hah, I'm absolutely fine with that restriction. Julia and C# can give us some ideas, but neither are a dynamic language like Python.

jimbaker commented 4 years ago

Okay, in order to make this more concrete, I started a branch where we can work on implementing f{args}.

https://github.com/gvanrossum/cpython/tree/quoting

Nice, I will try it out!

So far it only supports calls with exactly one argument, and it translates that argument into the stringified expression (so no Quote object yet).

Proof it works:
>>> def foo(arg):
...   print("The argument is", repr(arg))
... 
>>> foo(2+2)
The argument is 4
>>> foo{2+2}
The argument is '2 + 2'
>>> foo{x+1}
The argument is 'x + 1'
>>> 
The next challenge will actually be changing it so that instead of calling f('x+1') it calls f(Quote("x+1", lambda: x+1)). I'm not sure how to synthesize that lambda yet. But it's probably a simple matter of creating a new AST node that starts with lambda: and ends with the argument.

I figure extending this to support multiple positional arguments will be straightforward. I'm not sure what to do with f{*args} so for now that's an error. I think I'll leave keyword arguments for much later (say when we're writing a PEP :-).

Yeah, got to get a few things figured out before then!

gvanrossum commented 4 years ago

New version (same branch). This translates the quoted arg into a tuple (unparsed, function). Example:

>>> def f(a):
...  u, f = a
...  print(u)
...  return f()
... 
>>> f{2+2}
2 + 2
4
>>> x = 40
>>> f{x+2}
x + 2
42
>>>

UPDATE: Now supporting multiple arguments:

>>> def f(*a):
...  for u, f in a: print(u, '=', f())
... 
>>> x = 40
>>> f{x+2, x*2, x**2}
x + 2 = 42
x * 2 = 80
x ** 2 = 1600
>>>

gvanrossum commented 4 years ago

This leaves out the Quote class; for now I think a tuple will work fine.

It also leaves out turning f{...} into f.__qcall__(...).

I think however that we can now build working prototypes, e.g. qmap{f(x), x}. Coming up next!

gvanrossum commented 4 years ago

So this works, sort of:

def qmap(func, seq):
    func_s, func_f = func. # E.g. 'x+1', lambda: x+1
    seq_s, seq_f = seq. # E.g. 'x', lambda: x
    new_lambda = f"f = lambda {seq_s}: {func_s}"
    # E.g. new_lambda = "f = lambda x: x+1"
    ns = {}
    exec(new_lambda, func_f.__globals__, ns)
    f = ns["f"]
    # E.g. f = lambda x: x+1
    for x in seq_f():
        yield f(x)

def main():
    x = [1, 2, 4, 8]
    print(list(qmap{x+1, x}))

main()

However it doesn't work if the first arg uses any locals from main, since it just ends up calling exec() using the __globals__ of the first lambda. This suggests that we need more transformations on AST.

Maybe the lambda I synthesize in the parser (see the branch, not this little example) should actually get arguments whose defaults capture the values of the corresponding variables? E.g. f(x) would not become lambda: f(x) but lambda f=f, x=x: f(x). So we can still call it as func() and get the original computation (though with values captured early!), but we could also call it as func(x=42) which would compute f(42).

But if we do this we lose the ability to use undefined variables, since defaults capture the values early. We could solve that problem by turning the defaults into lambdas themselves, so that f(x) would translate to the (unintelligible :-( ) expression lambda f=(lambda: f), x=(lambda: x): (f())(x()) and if we wanted to call it with an argument we'd have to lambdify that argument: func(x=lambda: 42).

That honestly looks horrible, so I think maybe we should be okay with early capture. What do you think?

jimbaker commented 4 years ago

.. That honestly looks horrible, so I think maybe we should be okay with early capture. What do you think?

I have to think about early capture. But in general, I think of the lambda as a very convenient symbolic table, where we don't have to modify anything deep about compilation to get this quoting to work. So I'm less concerned about what it looks like (this is just more nested) so long it's serving a purpose.

I quickly tried out a variant of selection for the Numpy scenario, in this case where Python does all the actual work of evaluating the select expression, logical short cutting included and nested scopes compatible. It's pleasingly simple, but I need to play with some more rewriting now that this is possible.

from functools import total_ordering

@total_ordering
class obj:
    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        if hasattr(other, 'value'):
            return self.value == other.value
        else:
            return self.value == other

    def __le__(self, other):
        if hasattr(other, 'value'):
            return self.value <= other.value
        else:
            return self.value <= other

    def __repr__(self):
        return f'<obj {self.value}>'

    def __call__(self, expr):
        print(f'{self=!r}, {expr=}')
        selected = expr[1]()
        return self if selected else None

def f(value):
    x = obj(value)
    def g(a, b):
        return x{a <= x < b}
    return g

g = f(10)
print(g(1, 11))

gvanrossum commented 4 years ago

Nice!

One thing that now worries me: curlies look too much like plain old parentheses, and I had a heck of a time finding the qcall in your example. :-(

jimbaker commented 4 years ago

One thing that now worries me: curlies look too much like plain old parentheses, and I had a heck of a time finding the qcall in your example. :-(

There's other possible syntax, but likely we will be able to better see the curlies once we are used to them being used. My thinking is that otherwise, it would be hard to distinguish {1, 2, 3, 4} from (1, 2, 3, 4) - or similar visible token differences between like set comprehensions and generator expressions. Pragmatically, this is not the case for developers.

I'm too tired today from being on-call this week, but a couple of things:

There's enough of a foundation now to implement a working example of #7, so I will look into that with respect to Numpy,
Likewise, I would like to see tagged strings support with quoting. and there's enough guidance I think with your branch that I can implement that part too. I wanted to touch the new PEG parser, so now there's a good excuse to do so.

gvanrossum commented 4 years ago

Warning: f-strings are a challenge even for the PEG parser.

Tagged strings could be prototyped as ‘NAME STRING+’ in the grammar (allowing a space between, but that’s fine in a prototype) but you’d still have to refactor the f-string code (a bit) to parse the interpolations.

jimbaker commented 4 years ago

Heads up: I continue to be swamped with work, but I completed some big chunks as of today. I also have an InDay I can dedicate to the quoting work on Friday and look at expression rewriting and PEG parsing per your note above. I will then be taking vacation the following week, and then hopefully a better cadence going forward.

gvanrossum commented 4 years ago

No worries, I am working on other things as well. Enjoy your vacation!