nteract / papermill

📚 Parameterize, execute, and analyze notebooks
http://papermill.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
6k stars 429 forks source link

Parameter parsing fails for strings containing `=` character #772

Open nateharms opened 10 months ago

nateharms commented 10 months ago

Hello there! I'm using Papermill to automate some Jupyter notebook execution and I think I may have caught a potential bug: when running Papermill using strings containing a = character as the default parameter, the parser fails.

I get an error that looks like the following, but anonymized:

Unable to parse line X 'parameter='./all/date=20240102/ml_flow_run_id=Y/model/''.

After a little digging, I think this is because of the following code in translators.py:


if nequal > 0:
   grouped_variable.append(flatten_accumulator(accumulator))
   accumulator = []
   if nequal > 1:
       logger.warning(f"Unable to parse line {iline + 1} '{line}'.")
       continue ```

Since my parameter has = characters in it, nequal > 1 == True and we run into the warning.

Is this a desired behavior? i.e., is it expected that there should not be any = within default strong parameters? Or are there plans to enable parameterization of strings containing = as defaults?

Thank you!

MSeal commented 9 months ago

For now it's a constraint of the implementation it seems. We could improve that '=' check to check after doing quoted string capture groups to get around it. But someone would need to write a fairly good suite of unittests to ensure it works correctly in all cases (harder problem to get right than it may seem).

FlorianBracq commented 7 months ago

Hello,

would it be possible to use the built-in ast package to parse the python code? https://docs.python.org/3/library/ast.html

I saw some comments from ~2020 about this not being compatible with other languages, but I would argue it's a safer approach than the regex version and would catch corner cases that are not yet properly supported by the regex, such as the one shared here or if the type provided contains a "." (cf sample below)

Sample parameter cell

import datetime as dt

string_param: str = ""
equal_param = "./all/date=20240102/ml_flow_run_id=Y/model/"
multi_line_param: str = "test1" + "test2"
date_param: dt.datetime = dt.datetime(2024, 4, 17, 0, 0, 0)
a = b = 2
list_param = []
dict_param = {
    "key1": "value",
    "key2": [1, 2, 3],
    "key3": "=",
}

Sample parsing code:

# param_code is a variable containing the above cell's source code
tree: ast.Module = ast.parse(param_code)
for node in ast.walk(tree):
    if isinstance(ann_assign := node, ast.AnnAssign) and isinstance(
            name := ann_assign.target, ast.Name
        ):
            print(name.id)
    elif isinstance(assign := node, ast.Assign):
        for target in assign.targets:
            if isinstance(target, ast.Name):
                print(target.id)

I get the following output:

string_param
equal_param
multi_line_param
date_param
a
b
list_param
dict_param

@MSeal: if that would be of interest to you, I can propose a PR for this :)