iris-hep / func_adl

Construct hierarchical data queries using SQL-like concepts in python
MIT License
7 stars 4 forks source link

Cannot use imported module names in lambdas #127

Closed masonproffitt closed 1 year ago

masonproffitt commented 1 year ago

These lines don't check that a variable actually represents a constant before replacing a Name node with a Constant node:

https://github.com/iris-hep/func_adl/blob/89713708d8d1dc7234a38d78a6961170f4d26dca/func_adl/util_ast.py#L12-L15

For Python versions before 3.8, some type checking is done:

https://github.com/iris-hep/func_adl/blob/89713708d8d1dc7234a38d78a6961170f4d26dca/func_adl/util_ast.py#L17-L29

but in either case, this behavior causes issues here:

https://github.com/iris-hep/func_adl/blob/89713708d8d1dc7234a38d78a6961170f4d26dca/func_adl/util_ast.py#L335-L336

when v does not represent a literal. This causes problems for the Uproot backend, where I allow references to modules like np and ak. For example, I might use something like ds.Select(lambda event: np.cos(event.MET_phi)). That works fine as long as you haven't imported the relevant module in the source where the lambda is (since there is no potential capture), but it breaks if that module has been imported. If the surrounding code needs that module, the only solution is to go back to passing the lambda as a string.

I would argue that the correct behavior here is for the unmodified Name node to be passed through if the captured variable is not a constant (maybe with a warning or even just a debug log message). That's consistent with what happens if the name hasn't been defined outside of the lambda. This leaves it up to the particular code generator how to handle any apparently undefined Name, which basically allows each backend to set its own reserved keywords.

masonproffitt commented 1 year ago

Just to add that I also get this warning (before the invalid ast.Constant node raises an error) for something like np.cos:

WARNING  func_adl.type_based_replacement:type_based_replacement.py:688 Method cos not found on object <class 'module'>

I guess func_adl is trying to find the method in the module class rather than in the actual module (numpy)?

gordonwatts commented 1 year ago

If you ask for a module that is not known in ServiceX, it should error out during code-generation and that should get the error back to the user.

gordonwatts commented 1 year ago

Hmmm... In python 3.7 I can see the error being raised in the else statement as you point out, @masonproffitt . However, if 3.8 on, I do not get an error. Though I would imagine the qastle is probably bad. Above here it seems to say that you would expect an exception thrown even if you are using 3.10 or 3.11. In short, I'm worried I am not reproducing your error. Could you take a look at the PR that is linked here and look at the added tests and tell me if they are covering this? But work in 3.10, 11, and 12.

Second question. What is the proper behavior when you see "np.cos(xxx)"? A name reference w.r.t. "np" or "numpy"? If the user has imported numpy as np, then we can see that. The reason I ask is I would think we'd not want to depend on something weird the user has done "import numpy as nnp" (or something). So when we have the import, perhaps we should do "numpy".

@masonproffitt

masonproffitt commented 1 year ago

Right, the error I get is actually from when I try to deepcopy the AST; I get TypeError: cannot pickle 'module' object. But the real problem is that the Python AST is not valid (a non-literal, non-ast object like a module should never be inside a Python AST).

As for the name, it should pass whatever name is used in the lambda (so np in the np.cos example). Basically it should be oblivious to what has been imported or not, because we have no way of actually passing a module to the backend (maybe that could be a future project, but that's major work).

gordonwatts commented 1 year ago

Ok - got it: A few things:

gordonwatts commented 1 year ago

@masonproffitt - Ok - I've updated the PR to just transmit the name. Please take a look. But, please also let me know what you think about the above proposal.

masonproffitt commented 1 year ago

Instead, I think the code that looks at the constant should note that it is a module (you can do that, I saw it in the debugger) and then substitute with a transformer approved module name. We could have a mapping or we could have the full name, no matter.

I think this is more than we should really do. I don't want to give the impression that arbitrary numpy or awkward functions are allowed in queries--many if not most will result in very confusing errors or completely wrong results. The intention is that there's a very limited set of functions that should ever appear (only allowing the math module might actually be less confusing, but I figured people are probably most used to using np or ak versions of math functions). I view something like np.cos as one whole reserved keyword. I believe this is similar to what func_adl_xAOD does with things like sqrt.

Basically, I think the right thing to do for now is just to document what functions are allowed. This already done in func_adl_xAOD's README; I can do something similar for func_adl_uproot.

gordonwatts commented 1 year ago

Ok - so leave the modification as it is now, then (modulo your comment, which I think until we have a real example, we might ignore).

SHould we white-list things? That can be a separate bug/feature.