roryk / ipython-cluster-helper

Tool to easily start up an IPython cluster on different schedulers.
148 stars 23 forks source link

Modules in PYTHONPATH cannot be loaded because of -E python option in cluster.py #54

Open ofajardo opened 7 years ago

ofajardo commented 7 years ago

Hi,

I first of all would like to congratulate you for this excelent module. It makes very easy to use ipython-parallel, and also it is much easier than other parallelilzation modules I have been testing. It works well in our cluster where we have slurm, and even avoids us the nuisance of writing the slurm batch files!

But we have a problem. In our environment, we use module loader "Lmod". This makes very easy to have different libraries and libraries version, or applications (it works not only for python, but everything), and to switch from one to another in a very convenient way, and to share the same environment across multiple machines and users. It is really a robust system for production.

The issue is that, this Lmod, when it loads a python module, among other things, prepends the path to the module to the PYTHONPATH variable. ipython-cluster-helper , in the file cluster.py, when it fires a python interpreter, it uses the flag -E which gets rid of environment variables, among those PYTHONPATH. THe consequence is that the processes that ipython-parallel launches do not see third party libraries loaded with Lmod. Removing this "-E" flag from cluster.py cures the problem.

Would it be possible to have some kind of option to not use this -E flag?

Under which circumstances is good to have the -E flag (why did you introduce it? normally I would say that you would like that the child processes have the same environment as the master)

Thanks!

roryk commented 7 years ago

Thanks @ofajardo,

Thanks for the nice note. We set the -E flag to try to get around endless issues we were seeing with people having different site-libraries set up with modules built and linking to different libraries than what they were running ipython-cluster-helper with, which causes really hard to debug incompatibilities. We've found it is pretty useful to install a private python rather than use one loaded through a module type system-- https://conda.io/miniconda.html is really lightweight and only takes a few minutes to get up and cooking and you don't have to have root or anything, it is your own installation.

If that isn't an option, let us know and we can add an option to remove that flag for you.

ofajardo commented 7 years ago

Roryk, thanks a lot for your quick answer.

Yes, I see your point. We however solve this mess with the modules with the opposite approach, which is having a system wide central place for modules, that all users and machines can access. Users then pick the modules they need. This helps a lot sharing between different users, and using generic accounts instead of personal ones for productive workflows.

So, yes it would be really awesome if you could add such an option to get rid of the -E flag!

I also have a couple of other questions: Considering your example code:

from yourmodule import long_running_function

1- If I print anything inside long_running_function, I don't see the output neither in the console or in the log files. Is that the way it is? Or is there any chance to get the printing?. 2- Let's say long_running_function uses a third party library, such as for example biopython. If I do as you propose, it works. However, if I put long_running_function in the same script where I am calling the view, then long_running_function says there is no module biopython, and in order to solve that I have to import biopython again inside the function. Is that the way it should be, or again I am having some strange effect on my environment?

This is what I am talking about:

from cluster_helper.cluster import cluster_view
import Bio

##
# Parameters
num_jobs = 4 #number of jobs = number of CPUs
biopython_child = True #try to import biopython in the function, if True it produces the error

def sum_list(mylist, biopython_child):
    """
    Returns the sum of the list
    """
    if biopython_child:
        import Bio #this is necessary otherwise error
        print "biopython version from child:", Bio.__version__ # I cant see this printing
    return sum(mylist)

if __name__ == "__main__":

    lists = [range(0,100) for x in range(0,num_jobs)]
    flags = [biopython_child] * num_jobs

    with cluster_view(scheduler="slurm", queue="defq", num_jobs=num_jobs, cores_per_job=1) as view:
        print "biopython version from main:", Bio.__version__
        # map 
        map_result = view.map(sum_list, lists, flags)
        # reduce
        reduce_result = sum(map_result)
        assert reduce_result ==  4950 * num_jobs
        print "your result is:", reduce_result
        print "done!"

error if not import Bio in sum_list:

ipyparallel.error.CompositeError: one or more exceptions from call to method: sum_list [Engine Exception]NameError: global name 'Bio' is not defined

Thanks a lot in advance!

roryk commented 7 years ago

Hi @ofajardo,

The imports need to be inside the function unfortunately if you are defining the function in the main script like this; the reason is that what is happening is the function is getting pickled and sent to the engines and it isn't capturing the Bio library loaded from the main script. If the function is part of a module then you don't need to do that.

The output from the print statements should show up in the stderr and stdout of the engine jobs. To capture the stderr and stdout you'd have to set up some custom logging to capture that. It is easiest if functions you are sending for computation are pure, so they don't have side effects like printing or writing out a file or anything, otherwise you end up having to debug parallel writing/printing issues and what not.

I'll add an option to remove -E for you all.

ofajardo commented 7 years ago

Thanks roryk, those issues were minor, and what you say makes sense, so we can work like that. It is good tough to know it is the expected behavior, and not that we are having some incompatibilities with our environment. Thanks in advance for the option to remove -E!

roryk commented 7 years ago

To get this working, did you just remove the -E from the cluster setup, or did you have to remove it from the engine/controller jobs as well?

ofajardo commented 7 years ago

I removed all occurences of -E in cluster.py. I did not test what is the minimum to remove and still get it to work ... I can do that if you like, let me know if I should and I can do after I am back in the office on the first week of September.

El 23 ago. 2017 7:40 p. m., "Rory Kirchner" notifications@github.com escribió:

To get this working, did you just remove the -E from the cluster setup, or did you have to remove it from the engine/controller jobs as well?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/roryk/ipython-cluster-helper/issues/54#issuecomment-324409360, or mute the thread https://github.com/notifications/unsubscribe-auth/AQsCZ7SCQnrDePIxAysJ0hgJ8fk6w1QVks5sbGQigaJpZM4O9ZxY .

ofajardo commented 7 years ago

Hi again,

In order for it to work I had to take out the -E from the function create_throwaway_profile and from the variables cluster_cmd_argv, engine_cmd_argv, controller_cmd_argv.

The only place that was not needed to modify was the init of ClusterView class, most likely because IPython itself is installed in site-packages, while many other modules we have not in site-packages but with this LMod mechanism. From that point of view I would say it sounds safer to remove it from that init as well, as eventually the IPython may not be in site-packages.