scidash / neuronunit

A package for data-driven validation of neuron and ion channel models using SciUnit
http://neuronunit.scidash.org
38 stars 24 forks source link

Should we switch from Scoop to IPyParallel? #80

Closed rgerkin closed 7 years ago

rgerkin commented 7 years ago

@russelljjarvis

russelljjarvis commented 7 years ago

SCOOP vs Ipyparallel.

Strengths Weaknesses Opportunities Threats:

Arguments in favor of scoop:

Scoop draw backs:

ipython-parallel advantages.

ipython-parallel dis-advantages.

https://github.com/0x0L/jupyter-cluster https://github.com/ActivisionGameScience/ipyparallel-mesos/ https://github.com/ipython/ipyparallel/issues/51

** It seems like the resulting nb, should not be user friendly to run, and would therefore be in violation of notebook philosophy.

** Mapping docker container to singularity image would be clumsy, since the singularity image would need a docker daemon, and two containers to execute. From experience, I know that docker in singularity is highly discorouged and not very good ... although this could change quickly?

Apache-Spark,

another parallel framework for python, I don't really understand but there seems to be a lot of community and support for it.

Future directions.

It could make sense to do try to do all future analysis, and plotting development work of the optimization outputs in parallel ipython notebooks.

But to also make an initial notebook that makes calls to the os.system('python -m scoop filename.py') via ! ipy magic, do checkingpointing of the namespace variables created in these files and drag the checkpointed names into ipython notebook, maybe using the module query

I can also notebookerise this code by placing it in markdown cells

Needed to make better judgements:

This might be such a thing: https://github.com/vishnu2kmohan/ipyparallel-docker

** A methods for A translating ipython notebook code blocks, into not just pure python code, but actually pure python instructions.

** A modification of scoop.futures.launcher init constructor, such that it expects not the file of python code executable.py, but instead can except a long string or list of strings, that are executable python instructions ie a rewrite of launcher, such that it can act on and satisfy exec('instructions') as opposed to execfile('file.py').

This might sound simple but its probably deceptively a lot of work.

ipython parallel can also work with numba and Just In time compiled numba code.

http://numba.pydata.org/numba-doc/0.11/prange.html

russelljjarvis commented 7 years ago

After doing a lot of the work associated with switching, I now believe that the answer is yes, we should switch to ipyparallel.

The two main reasons are: Better integration with documentation via, notebooking, and also because notebooks have checkpointing, which can save time on big runs.

I am getting the sense that ipyparallel might be slower than scoop, however I am unsure if things have to be this way, and if its possible to set the backend as ZeroMQ (0MQ), which might even be the default in future releases.

I also do not know exactly how to get read the stack traces off workers that fail. I am guessing you can set a flag that creates a verbose log file with each run but I am unsure how yet.

If stderr suddenly becomes accessible, then at least development (including documentation development time) might become faster via ipyparallel, at least because of checkpointing.

rgerkin commented 7 years ago

Sounds good.

russelljjarvis commented 7 years ago

Dask like ipyparallel supports a simple invocation of ipython, and it supports nested calls to map_reduce in a way that truly flattens out an otherwise sequential method call flow.

ipyparallel supports integration with dask, and dask as a backend. Dask has more mature documentation, in the sense that it is very realistic about performance improvements, and concerned that the code will not result in good enough utilization and efficiency of CPUs, as compared to serial execution that leverages numba and jit compilation of iteration routines and or use of cython for optimization.

https://gist.github.com/mrocklin/dc7fd426a6e405fef275

dask also better support for hdf5 and pandas, but possibly less syntactic sugar.

from ipyparallel import Client
c = Client()  # connect to IPyParallel cluster
e = c.become_dask()  # start dask on top of IPyParallel

Since ipythonparallel seems to be dask aware, hopefully, some of dasks features will converge into ipyparallel.

russelljjarvis commented 6 years ago

A pure dask implementation might allow for nested calls to map_reduce.

Also python-intel distributed has since been released, but it comes in binaries only, so it is not really open source I guess. https://software.intel.com/en-us/distribution-for-python/features

russelljjarvis commented 6 years ago

I actually think we should switch to dask sometime in the future. This is based on what the primary author of ipyparallel recommends on stackexchange.

Dask is just as ipython friendly as ipyparallel, however I think it's also a more versatile tool, with more extensive documentation.

https://stackoverflow.com/questions/37998484/how-to-efficiently-chain-ipyparallel-tasks-and-pass-intermediate-results-to-engi Qoute from that stack exchange thread: " Option 2: dask.distributed

Full disclosure: I'm the primary author of IPython Parallel, about to suggest that you use a different tool.

It is possible to pass results from one task to another via engine namespaces and DAG dependencies in IPython parallel, but honestly, if your workflow looks like this, you should consider using dask distributed, which is designed specifically for this kind of computation graph. If you are already comfortable and familiar with IPython parallel, getting started with dask should not be too much of a burden. "

rgerkin commented 6 years ago

I'm fine with it, but let's delay it until after the next release of NeuronUnit. We really need to focus on code coverage, passing tests, and readable, documented code. If IPyParallel passes simple tests but fails with very large optimizations, that is OK for the next release.

russelljjarvis commented 6 years ago

I accidentally switched first.

Reasons:

http://dask.pydata.org/en/latest/dataframe.html

import statements such as import dask etc are sufficient for creating a local cluster like environment.

With the single machine schedulers, use the rerun_exceptions_locally=True keyword.

x.compute(rerun_exceptions_locally=True)

On the distributed scheduler use the recreate_error_locally method on anything that contains Futures :

From: https://dask.pydata.org/en/latest/debugging.html

russelljjarvis commented 5 years ago

@HamiltonWang.

If you cast a numpy array to type

pop_bag = dask.db.from_sequence(pop)

you can use:

pop=list(pop_bag.map(method_type).compute())