vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
274 stars 45 forks source link

Global Python functions no longer works? #49

Closed gaow closed 8 years ago

gaow commented 8 years ago

After updating to the current SoS my previous code does not work:

def get_md5(value):
    import sys, hashlib
    base, ext = value.rsplit('.', 1)
    res = hashlib.md5(base.encode('utf-8')).hexdigest() if sys.version_info[0] == 3 else hashlib.md5(base).hexdigest()
    return '{}.{}'.format(res, ext)

[simulate_1: alias = 'simulate']
n = [1000]
true_mean = [0, 1]
output_suffix = 'rds'
input: for_each = ['n', 'true_mean']
output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}'
_output = [get_md5(x) for x in _output]
run: ...

The error message is:

ERROR: Failed to execute subworkflow: Failed to assign [get_md5(x) for x in _output]
 to variable _output: name 'get_md5' is not defined

What is the proper way to use customized Python functions with current SoS?

BoPeng commented 8 years ago

This is a namespace issue that I am trying to address, namely we now have SoS script locals (for SoS variables), SoS script globals, SoS command locals and globals, and we should also handle namespace of spawned step processes. The first two are related to SoS script and the last two are the SoS runtime. Previously we have these namespace mixed which means malicious user defined code might interfere with how SoS works (e.g. replace run action). The updated code tries to separate them but get_md5 is now defined in SoS script local but called in SoS runtime local.

On Fri, Mar 25, 2016 at 10:46 AM, gaow notifications@github.com wrote:

After updating to the current SoS my previous code does not work:

def get_md5(value): import sys, hashlib base, ext = value.rsplit('.', 1) res = hashlib.md5(base.encode('utf-8')).hexdigest() if sys.version_info[0] == 3 else hashlib.md5(base).hexdigest() return '{}.{}'.format(res, ext)

[simulate_1: alias = 'simulate'] n = [1000] true_mean = [0, 1] output_suffix = 'rds' input: for_each = ['n', 'true_mean'] output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}' _output = [get_md5(x) for x in _output] run: ...

The error message is:

ERROR: Failed to execute subworkflow: Failed to assign [get_md5(x) for x in _output] to variable _output: name 'get_md5' is not defined

What is the proper way to use customized Python functions with current SoS?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/bpeng2000/SOS/issues/49

gaow commented 8 years ago

I see. Will this syntax remain the same and I should wait for an update of SoS, or are there any alternative syntax I can use to achieve my purpose? I assume moving it to inside SoS step is not good idea because the function will be used in multiple steps.

BoPeng commented 8 years ago

I also disabled writing to _step object directly, made variables defined in global definitions readonly, and I might have disallowed changing _input, _output and _depends from SoS script. As useful as it might be, allowing SoS scripts to change these variables can be really troublesome. For example,

[parameters]
var = a

uses a default value defined in global section. If a can be changed, the default value of this variable can be changed. This is problematic if the workflow is used in a nested workflow where another workflow might change a accidentally and cause unpredictable results.

The same logic goes to _step and their alias. If we allow changing alias.output, the same step might behave differently with the execution of another step (that changes an existing alias) so theoretically speaking we cannot execute any steps in parallel.

BoPeng commented 8 years ago

I am working on this. I left this unfixed but I need to set up sensible rules what can and what cannot be accessed in what way from within SoS steps... Your code should work though.

More specifically, SoS currently executes these functions (e.g. run) in its own namespace because they are not available in SoS namespace. What we need to do is to inject only the ones that are needed by SoS script to them before the execution of the script and evaluate all function calls in SoS script namespaces. In this way SoS itself will not be affected by malicious scripts.

gaow commented 8 years ago

The updates are reasonable when -j comes into the picture. Currently my code does have a line that changes _output directly. What if we need to adjust the names of an output within a step?

BoPeng commented 8 years ago

I believe update of _output yields a warning now. The idea is that we should not need to manipulate _output directly if we have a strong enough syntax in step output. We will see.

gaow commented 8 years ago

Indeed. Are you suggesting of designing some options that can allow for flexible post-processing of these file names at the end of input: and output: specifications?

BoPeng commented 8 years ago

Yes. The easiest and most general solution is to allow the specification of a user-defined function that take _output and returns massaged filenames. Any suggestion on a parameter name or other solution?

BoPeng commented 8 years ago

I have added a test case (TestExecute.testUserDefinedFunc) but it works. Could you add a test case that demonstrate the problem? Might it involve subworkflow?

gaow commented 8 years ago

Using another named parameter might be straightforward and a rename = might do if all we do is renaming. But that will restrict the renaming within one line so perhaps users need to provide customized functions, e.g., my script can be written as:

output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}', rename = [get_md5(x) for x in _output]

and is it cool to have _output here?

BoPeng commented 8 years ago

By general I meant the function should be able to remove some output files. No, I do not think it is good to use _output here. The usage should be something like

rename=get_md5

or

rename=lambda x, **kwargs: get_md5[x[0]]

if get_md5 cannot handle multiple names.

gaow commented 8 years ago

I just updated the test. It turns out to be an issue with list comprehension. If I put [myfunc() for x in range(5)][0] instead of just myfunc() as in the test it will fail.

BoPeng commented 8 years ago

Oh, this is a nested namespace issue. myfunc needs to be in the globals namspace for it to be seen by list comprehension but it is currently imported to the locals namespace.

BoPeng commented 8 years ago

Just use this thread to summarize namespace rules:

  1. SoS will be executed in its own namespace. Everything will be eval or exec with their own locals and globals. This will make sure SoS itself will not be affected by SoS script.
  2. SoS script will have two namespaces: The global namespace and local namespace that will be used to execute statements.
  3. SoS global namespace consists of
    • SoS actions and functions injected by SoS when the script is executed
    • All global definitions
    • Variables seen in other steps (alias etc)
  4. SoS local namespace consists of step specific variables (_step, _input ...). Local namespace will be cleared after the execution of each step.
  5. Spawned process will get a copy of local namespace (with _input, _index etc) because these differ process by process.

Pending problems:

  1. The most difficult part is the global namespace because it contains modules and function objects that cannot be transferred. My suggestion is that
    • We re-run the global definition in spawned process so re-import and re-define functions.
    • We transfer pickleable objects to the spawned process. Because we required that global definitions to be read-only, there should be no conflict between these two steps. We can also ONLY transfer StepInfo objects (aliased step information) because right now a step only leave such information in the global namespace.
  2. A potential problem is with functions defined in SoS step. These should be kept in the Step's own local namespace. However, when we evaluate a nested function in Step, it only see's the function's namespace (locals), its parent function's namespace (nested namespace), and the global namespace. This means the locally defined function will not be seen. I am not sure how to handle this problem right now.
  3. Nested namespaces: There will be multiple global sections and in theory we should either
    • merge global sections during parsing and let it become the new global section of the new workflow
    • assign global section to each section so that sections have their own global definition. To make sure global sections from different sources are executed, these definitions are currently executed for each step (which is a bit wastful).
gaow commented 8 years ago

Great summary! I can change my implementation to not using list comprehension. But how do I make it to global instead? From reading the summary I still haven't figured out how it works in practice if I stick to my current syntax (again I can change it but just curious).

BoPeng commented 8 years ago

The problems are not resolved. We can use a single global namespace for everything but the problem is that if step_15 can use a function or variable defined in step_10, it cannot be executed before it. Right now the DAG will be set up purely from file dependencies so we have to make functions and variables defined in steps local, then function calls within step will face the triple namespace problem and might have trouble accessing functions defined locally.

gdict = {'a': 1}
ldict = {'b': 2}
# this works
eval('a+b', gdict, ldict)
# function only sees 'a' and its own local namespace
exec('''
def func():
    print(locals().keys())
    print(globals().keys())
func()
''', gdict, ldict)
# so this fails
exec('''
def func():
    print(b)

func()
''', gdict, ldict)

A seeming feasible way to address this problem is to insert ldict to a nested namespace of func so that it can be accessed, something like:

def add_names(extra_dict, func):
    def temp(*args, **kwargs):
        # This does not work
        locals().update(extra_dict)
        return func(*args, **kwargs)
    return temp

add_names(ldict, func)()

However, updating locals() does not work. Even the exec trick does not work here because the dictionary might contain functions which cannot be exec-ed. There are many discussions about this online but I do not see a good solution.

In the end, we might have to manually merge locals to globals and remove locals, but this will not work because we are allowing arbitrary statements in steps.

BoPeng commented 8 years ago

I find a solution, which is as ugly as hell. Note that the namespace rule is

  1. local <- function local, cannot change
  2. global <- SoS global, difficult to change
  3. nested <- the above decorator-style trick, does not work
  4. builtins ???

So

gdict = {'a': 1}
ldict = {'b': 2}

for k,v in ldict.items():
    setattr(__builtins__, k, v)

exec('''
def func():
    print(b)

func()
''', gdict, ldict)

works. As long as we do not change anything in __builtins__, this should be safe...

Finally, a less-ugly but resource intensive solution would be to use a single globals dictionary but execute each step in a separate process, then any change to globals would not affect the main process. This might be necessary anyway because of potentials to run steps in parallel.

gaow commented 8 years ago

Oh ... I do not think I understand the example. So here, b is set to 2 and is accessible even inside the exec? What does gdict do here? And the exec action is for definition of functions -- for what namespace?

BoPeng commented 8 years ago

Python exec function accepts a local and a global dictionary to execute the passed statement. I am demonstrating that passing b in the local dictionary is accessible from expression, but not from within a function.

BoPeng commented 8 years ago

Ok, I have submitted a patch to merge the global and local namespace of SoS script. This is not ideal but at least the SoS namespace is now separated from the SoS runtime. The separation of local and global namespace could be done by either of the two methods proposed, but let us wait and see because the implementation of DAG would require steps to be executed in parallel so the problem resolves then naturally if the steps are executed in separate processes.

BoPeng commented 8 years ago

I have fixed this problem by running each step in its own process and only the aliased object is sent back to the master process (see TestExecute.testLocalNamespace). Now we are sure that variables in each step will not affect others.