Closed gaow closed 8 years ago
This is a namespace issue that I am trying to address, namely we now have SoS script locals (for SoS variables), SoS script globals, SoS command locals and globals, and we should also handle namespace of spawned step processes. The first two are related to SoS script and the last two are the SoS runtime. Previously we have these namespace mixed which means malicious user defined code might interfere with how SoS works (e.g. replace run action). The updated code tries to separate them but get_md5 is now defined in SoS script local but called in SoS runtime local.
On Fri, Mar 25, 2016 at 10:46 AM, gaow notifications@github.com wrote:
After updating to the current SoS my previous code does not work:
def get_md5(value): import sys, hashlib base, ext = value.rsplit('.', 1) res = hashlib.md5(base.encode('utf-8')).hexdigest() if sys.version_info[0] == 3 else hashlib.md5(base).hexdigest() return '{}.{}'.format(res, ext)
[simulate_1: alias = 'simulate'] n = [1000] true_mean = [0, 1] output_suffix = 'rds' input: for_each = ['n', 'true_mean'] output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}' _output = [get_md5(x) for x in _output] run: ...
The error message is:
ERROR: Failed to execute subworkflow: Failed to assign [get_md5(x) for x in _output] to variable _output: name 'get_md5' is not defined
What is the proper way to use customized Python functions with current SoS?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/bpeng2000/SOS/issues/49
I see. Will this syntax remain the same and I should wait for an update of SoS, or are there any alternative syntax I can use to achieve my purpose? I assume moving it to inside SoS step is not good idea because the function will be used in multiple steps.
I also disabled writing to _step
object directly, made variables defined in global definitions readonly, and I might have disallowed changing _input
, _output
and _depends
from SoS script. As useful as it might be, allowing SoS scripts to change these variables can be really troublesome. For example,
[parameters]
var = a
uses a default value defined in global section. If a
can be changed, the default value of this variable can be changed. This is problematic if the workflow is used in a nested workflow where another workflow might change a
accidentally and cause unpredictable results.
The same logic goes to _step
and their alias. If we allow changing alias.output
, the same step might behave differently with the execution of another step (that changes an existing alias) so theoretically speaking we cannot execute any steps in parallel.
I am working on this. I left this unfixed but I need to set up sensible rules what can and what cannot be accessed in what way from within SoS steps... Your code should work though.
More specifically, SoS currently executes these functions (e.g. run
) in its own namespace because they are not available in SoS namespace. What we need to do is to inject only the ones that are needed by SoS script to them before the execution of the script and evaluate all function calls in SoS script namespaces. In this way SoS itself will not be affected by malicious scripts.
The updates are reasonable when -j
comes into the picture. Currently my code does have a line that changes _output
directly. What if we need to adjust the names of an output within a step?
I believe update of _output
yields a warning now. The idea is that we should not need to manipulate _output
directly if we have a strong enough syntax in step output
. We will see.
Indeed. Are you suggesting of designing some options that can allow for flexible post-processing of these file names at the end of input:
and output:
specifications?
Yes. The easiest and most general solution is to allow the specification of a user-defined function that take _output
and returns massaged filenames. Any suggestion on a parameter name or other solution?
I have added a test case (TestExecute.testUserDefinedFunc
) but it works. Could you add a test case that demonstrate the problem? Might it involve subworkflow?
Using another named parameter might be straightforward and a rename =
might do if all we do is renaming. But that will restrict the renaming within one line so perhaps users need to provide customized functions, e.g., my script can be written as:
output: pattern = 'exec=rnorm.R::n=${_n}::true_mean=${_true_mean}.${output_suffix}', rename = [get_md5(x) for x in _output]
and is it cool to have _output
here?
By general I meant the function should be able to remove some output files. No, I do not think it is good to use _output
here. The usage should be something like
rename=get_md5
or
rename=lambda x, **kwargs: get_md5[x[0]]
if get_md5 cannot handle multiple names.
I just updated the test. It turns out to be an issue with list comprehension. If I put [myfunc() for x in range(5)][0]
instead of just myfunc()
as in the test it will fail.
Oh, this is a nested namespace issue. myfunc
needs to be in the globals
namspace for it to be seen by list comprehension but it is currently imported to the locals
namespace.
Just use this thread to summarize namespace rules:
eval
or exec
with their own locals
and globals
. This will make sure SoS itself will not be affected by SoS script._step
, _input
...). Local namespace will be cleared after the execution of each step._input
, _index
etc) because these differ process by process. Pending problems:
StepInfo
objects (aliased step information) because right now a step only leave such information in the global namespace.Great summary! I can change my implementation to not using list comprehension. But how do I make it to global
instead? From reading the summary I still haven't figured out how it works in practice if I stick to my current syntax (again I can change it but just curious).
The problems are not resolved. We can use a single global namespace for everything but the problem is that if step_15
can use a function or variable defined in step_10
, it cannot be executed before it. Right now the DAG will be set up purely from file dependencies so we have to make functions and variables defined in steps local, then function calls within step will face the triple namespace problem and might have trouble accessing functions defined locally.
gdict = {'a': 1}
ldict = {'b': 2}
# this works
eval('a+b', gdict, ldict)
# function only sees 'a' and its own local namespace
exec('''
def func():
print(locals().keys())
print(globals().keys())
func()
''', gdict, ldict)
# so this fails
exec('''
def func():
print(b)
func()
''', gdict, ldict)
A seeming feasible way to address this problem is to insert ldict
to a nested namespace of func
so that it can be accessed, something like:
def add_names(extra_dict, func):
def temp(*args, **kwargs):
# This does not work
locals().update(extra_dict)
return func(*args, **kwargs)
return temp
add_names(ldict, func)()
However, updating locals()
does not work. Even the exec
trick does not work here because the dictionary might contain functions which cannot be exec-ed. There are many discussions about this online but I do not see a good solution.
In the end, we might have to manually merge locals
to globals
and remove locals
, but this will not work because we are allowing arbitrary statements in steps.
I find a solution, which is as ugly as hell. Note that the namespace rule is
So
gdict = {'a': 1}
ldict = {'b': 2}
for k,v in ldict.items():
setattr(__builtins__, k, v)
exec('''
def func():
print(b)
func()
''', gdict, ldict)
works. As long as we do not change anything in __builtins__
, this should be safe...
Finally, a less-ugly but resource intensive solution would be to use a single globals
dictionary but execute each step in a separate process, then any change to globals
would not affect the main process. This might be necessary anyway because of potentials to run steps in parallel.
Oh ... I do not think I understand the example. So here, b
is set to 2 and is accessible even inside the exec
? What does gdict
do here? And the exec action is for definition of functions -- for what namespace?
Python exec
function accepts a local and a global dictionary to execute the passed statement. I am demonstrating that passing b
in the local dictionary is accessible from expression, but not from within a function.
Ok, I have submitted a patch to merge the global and local namespace of SoS script. This is not ideal but at least the SoS namespace is now separated from the SoS runtime. The separation of local and global namespace could be done by either of the two methods proposed, but let us wait and see because the implementation of DAG would require steps to be executed in parallel so the problem resolves then naturally if the steps are executed in separate processes.
I have fixed this problem by running each step in its own process and only the aliased object is sent back to the master process (see TestExecute.testLocalNamespace). Now we are sure that variables in each step will not affect others.
After updating to the current SoS my previous code does not work:
The error message is:
What is the proper way to use customized Python functions with current SoS?