scidash / neuronunit

A package for data-driven validation of neuron and ion channel models using SciUnit
http://neuronunit.scidash.org
38 stars 24 forks source link

trying to pickle tests or judge does not work #94

Closed russelljjarvis closed 6 years ago

russelljjarvis commented 7 years ago
def plot_db(test,judge,vms):
    import matplotlib.pyplot as plt
    plt.clf()
    matplotlib.use('Agg') 
    matplotlib.style.use('ggplot')
    #for t in tests:
    obs = []
    pre = []
    obs.append(judge.observation)
    pre.append(judge.prediction)
    print('observation {0} test {1}'.format(obs,tests[k]))
    print('prediction {0} test {1}'.format(pre,tests[k]))

Pickeling is done by ipyparalllel its necissitated because this is the way workers get method arguments in a call to

view.map_sync(evaluate, foo, bar)

Interestingly its only suite.judge that can't be passed between workers using serialization, suite.test is fine.

plot_db(get_neab.suite.tests,get_neab.suite.judge,vms)
def evaluate(vms):#This method must be pickle-able for ipyparalllel to work.
    '''
    Inputs: An individual gene from the population that has compound parameters, and a tuple iterator that
    is a virtual model object containing an appropriate parameter set, zipped togethor with an appropriate rheobase
    value, that was found in a previous rheobase search.

    outputs: a tuple that is a compound error function that NSGA can act on.

    Assumes rheobase for each individual virtual model object (vms) has already been found
    there should be a check for vms.rheobase, and if not then error.
    Inputs a gene and a virtual model object.
    outputs are error components.
    '''
    from neuronunit.models import backends
    from neuronunit.models.reduced import ReducedModel
    import quantities as pq
    import numpy as np
    import get_neab

    new_file_path = str(get_neab.LEMS_MODEL_PATH)+str(os.getpid())
    model = ReducedModel(new_file_path,name=str(vms.attrs),backend='NEURON')
    model.load_model()
    assert type(vms.rheobase) is not type(None)
    model.update_run_params(vms.attrs)
    tests = get_neab.suite.tests
    for t in tests:
        print(t.observation, t.describe())
    prepare_tests(get_neab.suite.tests,vms)
    model.update_run_params(vms.attrs)
    score = get_neab.suite.judge(model, stop_on_error = False, deep_error = True)
    plot_db(get_neab.suite.tests,get_neab.suite.judge,vms)
    model.run_number += 1
    # Run the model, then:
    error = []
    vms.score = []

    for my_score in score.sort_key.values.tolist()[0]:
        assert type(my_score) is not type(dict)
        error.append(my_score)
    vms.evaluated = True
    error = [ -1.0 * e for e in error ]
    vms.error = error
    return error[0],error[1],error[2],error[3],error[4],error[5],error[6],error[7],
fitnesses = toolbox.map(toolbox.evaluate, copy.copy(vmpop))
Traceback (most recent call last):
  File "nsga_parallel.py", line 594, in <module>
    fitnesses = toolbox.map(toolbox.evaluate, copy.copy(vmpop))
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/view.py", line 342, in map_sync
    return self.map(f,*sequences,**kwargs)
  File "<decorator-gen-130>", line 2, in map
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/view.py", line 50, in sync_results
    ret = f(self, *args, **kwargs)
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/view.py", line 615, in map
    return pf.map(*sequences)
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/remotefunction.py", line 285, in map
    return self(*sequences, __ipp_mapping=True)
  File "<decorator-gen-120>", line 2, in __call__
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/remotefunction.py", line 76, in sync_view_results
    return f(self, *args, **kwargs)
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/remotefunction.py", line 259, in __call__
    ar = view.apply(f, *args)
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/view.py", line 211, in apply
    return self._really_apply(f, args, kwargs)
  File "<decorator-gen-129>", line 2, in _really_apply
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/view.py", line 50, in sync_results
    ret = f(self, *args, **kwargs)
  File "<decorator-gen-128>", line 2, in _really_apply
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/view.py", line 35, in save_ids
    ret = f(self, *args, **kwargs)
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/view.py", line 557, in _really_apply
    ident=ident)
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/client/client.py", line 1395, in send_apply_request
    item_threshold=self.session.item_threshold,
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/serialize/serialize.py", line 166, in pack_apply_message
    serialize_object(arg, buffer_threshold, item_threshold) for arg in args))
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/serialize/serialize.py", line 166, in <genexpr>
    serialize_object(arg, buffer_threshold, item_threshold) for arg in args))
  File "/opt/conda/lib/python3.5/site-packages/ipyparallel/serialize/serialize.py", line 112, in serialize_object
    buffers.insert(0, pickle.dumps(cobj, PICKLE_PROTOCOL))
  File "/opt/conda/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 706, in dumps
    cp.dump(obj)
  File "/opt/conda/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 146, in dump
    return Pickler.dump(self, obj)
  File "/opt/conda/lib/python3.5/pickle.py", line 408, in dump
    self.save(obj)
  File "/opt/conda/lib/python3.5/pickle.py", line 520, in save
    self.save_reduce(obj=obj, *rv)
  File "/opt/conda/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 586, in save_reduce
    save(args)
  File "/opt/conda/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.5/pickle.py", line 725, in save_tuple
    save(element)
  File "/opt/conda/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 264, in save_function
    self.save_function_tuple(obj)
  File "/opt/conda/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 317, in save_function_tuple
    save(f_globals)
  File "/opt/conda/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.5/pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.5/pickle.py", line 836, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 264, in save_function
    self.save_function_tuple(obj)
  File "/opt/conda/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 317, in save_function_tuple
    save(f_globals)
  File "/opt/conda/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.5/pickle.py", line 810, in save_dict
    self._batch_setitems(obj.items())
  File "/opt/conda/lib/python3.5/pickle.py", line 836, in _batch_setitems
    save(v)
  File "/opt/conda/lib/python3.5/pickle.py", line 475, in save
    f(self, obj) # Call unbound method with explicit self
  File "/opt/conda/lib/python3.5/pickle.py", line 770, in save_list
    self._batch_appends(obj)
  File "/opt/conda/lib/python3.5/pickle.py", line 794, in _batch_appends
    save(x)
  File "/opt/conda/lib/python3.5/pickle.py", line 495, in save
    rv = reduce(self.proto)
  File "/opt/conda/lib/python3.5/site-packages/sciunit-0.1.5.8-py3.5.egg/sciunit/__init__.py", line 54, in __getstate__
    for key in self.unpicklable:
AttributeError: 'RheobaseTest' object has no attribute 'unpicklable'
rgerkin commented 7 years ago

@russelljjarvis Could you make the minimal recipe that raises this error? There is a lot going on here so I don't know how to reproduce it on my end.

rgerkin commented 7 years ago

@russelljjarvis I've added a check in sciunit for this attribute, so that the whole unpicklable check will be skipped if that attribute doesn't exist. But really, it always should exist so I haven't solved the underlying problem (although your code might work now). That's why I'd like a recipe so I can tackle this problem once and for all.

rgerkin commented 7 years ago

@russelljjarvis I pushed some new changes to scidash/neuronunit/dev that might help. They help ensure that all objects can be found again on unpickling.

However, I realize there is another issue I need a better grasp of, which is what kinds of things in model initialization (with the NEURON backend) should happen once on the host or one worker and just be pushed passively to others (e.g. simple attributes like a resting potential value) and what kinds of things need to be done independently on each worker. That will determine the structure and content of set_backend, load_model, etc. The nature of pickling requires that we put the first category of things in init (and functions it calls) and the second category of things in new (and functions it calls). This is because both new and init are called on model instantiation, but only new is called on unpickling. And new cannot simply call init, for various reasons. So in order to separate these things, it would help to understand what things you what acts must occur separately for each worker (e.g. maybe self.ns = nrn.NeuronSimulation(self.tstop, dt=0.0025)), so I can make sure those things get walled off from the other things that can be simply copied via pickle/unpickle from one worker to the next. Can you help with that?

russelljjarvis commented 7 years ago

This seems like a very complicated issue. I will need to read over it several times.

rgerkin commented 7 years ago

@russelljjarvis I think I finally have a solution to all of this. At least I have gotten past all of the pickling errors, and my computer is running hot working on the optimization. I don't know it is entirely fixed, but if it doesn't finish or results in an error it doesn't seem to be related to simple pickling problems.

1) Rebuild your stacks from scidash/docker-stacks/master (I largely did not use the ParallelPyNeuron version but will eventually need to add your changes in; you might be able to do it just from the ParallelPyNeuron version but I haven't tested that). This rebuild should involve getting the latest changes to sciunit and neuronunit dev branches. Or you can mount them to /home/jovyan/mnt/sciunit and /home/jovyan/mnt/neuronunit if you want to work with the (updated) versions on your local machine.
2) Use the scidash branch of ParallelPyNeuron where I have update the reproduce_error/Dockerfile to include a few things that are needed.
3) Checkout the dev branch ofrgerkin/informatics_poster and place it inside ParallelPyNeuron/reproduce_error next to the Dockerfile. This will contain the update nsga_parallel_no_vm.py file.
4) Do docker build /path/to/ParallelPyNEURON/reproduce_error -t error 5) Do docker run -it error Optionally, add -v /path/to/sciunit:/home/jovyan/work/sciunit and -v /path/to/neuronunit:/home/jovyan/work/neuronunit to the lines above to use your local copies of sciunit and neuronunit if they are updated.

This will at least get you past many of the annoying errors that have been plaguing us, but I don't know if it fully works yet because I don't really know what to look for other than the processor doing a lot of work and not getting any errors about attributes not being found.

rgerkin commented 7 years ago

@russelljjarvis Just to update, the nsga_parallel_no_vm.py file did run to completion, and models and pop do contain values that seem like the kind of values that would have been returned by a successful run, but I'm not entirely sure how to check this. What would be a good line to add to the end of the script to check that the optimization in that file ran successfully?

rgerkin commented 7 years ago

@russelljjarvis Check that this solution (in the updated sciunit and neuronunit) works for you.

russelljjarvis commented 7 years ago

I think I have a working recipe, but I have not rigorously tested it for the above workflow.

Thats funny, I didn't intend to edit your comment below.

I noticed that now once you select issues assigned to you, you can sort precedence, by the highest number of replies and counter replies. That was a good way to filter for this thread. Which scored second.

rgerkin commented 7 years ago

Working on it.

rgerkin commented 7 years ago

Simplifications after you verify that the above formula works:

russelljjarvis commented 7 years ago

@rgerkin

On this issue I have realized that the reproduce_error Docker container, and associated code above is using one version of backends on the hub/controller and a different version on the engines.

This is fine for pickling models and communicating them back to rank0 the hub/controller.

The result is a list where:

models[0]

is derived from the source /home/jovyan/mnt/neuronunit/backends.py

and

models[1:-1]

is derived from source: /home/jovyan/neuronunit/neuronunit/models/backends.py

Only models[1:-1] are transported from the engines/clients and back to the hub/controller. models[0] does not cause a problem because at this stage because it does not go anywhere.

The way I investigated this issue was to use the code file you referred to for viewing stdout/err on the engines see below for a reference back to that code.

Its in a file called: stdout_worker.py (contents of the file are at the bottom of this comment) I run it with in the background python stdout_worker.py & before running ipython -i nsga_parallel_no_vm.py.

with the code thats executed in inside dview.map_sync after the statements: from neuronunit import models if the next statement is: print(models.__file__) Then its /home/jovyan/neuronunit/neuronunit/models/__init__.py on the engines/clients

and

/home/jovyan/mnt/neuronunit/models/__init__.py or /home/jovyan/informatics_poster/neuronunit/models/__init__.py

On the hub/controller, depending on which code I run from.

The upshot of all of this, is getting consistency between the hubs paths and the clients paths remains an unsolved problem.

I have attempted to circumvent that problem by only making the model on the hub (where the paths are what I intend/expect), and then communicating the models to the workers/clients.

This approach works okay, however I am getting a strange error message at the moment that I am trying to work through:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    /opt/conda/lib/python3.5/site-packages/ipyparallel/serialize/serialize.py in unpack_apply_message(bufs, g, copy)
    199     kwargs = {}
    200     for key in info['kw_keys']:
--> 201         kwarg, kwarg_bufs = deserialize_object(kwarg_bufs, g)
    202         kwargs[key] = kwarg
    203     assert not kwarg_bufs, "Shouldn't be any kwarg bufs left over"

    /opt/conda/lib/python3.5/site-packages/ipyparallel/serialize/serialize.py in deserialize_object(buffers, g)
    130     bufs = list(buffers)
    131     pobj = buffer_to_bytes_py2(bufs.pop(0))
--> 132     canned = pickle.loads(pobj)
    133     if istype(canned, sequence_types) and len(canned) < MAX_ITEMS:
    134         for c in canned:

    /opt/conda/lib/python3.5/copyreg.py in __newobj_ex__(cls, args, kwargs)
     92     keyword-only arguments to be pickled correctly.
     93     """
---> 94     return cls.__new__(cls, *args, **kwargs)
     95 
     96 def _slotnames(cls):

    /home/jovyan/neuronunit/neuronunit/models/__init__.py in __new__(cls, *args, **kwargs)
     44         """
     45         LEMS_file_path: Path to LEMS file (an xml file).
---> 46         name: Optional model name.
     47         """
     48         #print("Calling new")

    /home/jovyan/neuronunit/neuronunit/models/__init__.py in set_backend(self, backend)
    109 
    110         elif name is None:
--> 111             # The base class should not be called.
    112             raise Exception(("A backend (e.g. 'jNeuroML' or 'NEURON') "
    113                              "must be selected"))

    /home/jovyan/neuronunit/neuronunit/models/backends.py in init_backend(self, attrs)
    130         self.h.load_file("stdgui.hoc")
    131         #self.h.cvode.active(1)
--> 132         self.orig_lems_file_path = None
    133         #pdb.set_trace()
    134         #self.h.cvode.active

    AttributeError: 'ReducedModel with NEURON backend with NEURON backe' object has no attribute 'orig_lems_file_path'
engine.5.error:
"""A script for watching all traffic on the IOPub channel (stdout/stderr/pyerr) of engines.

This connects to the default cluster, or you can pass the path to your ipcontroller-client.json

Try running this script, and then running a few jobs that print (and call sys.stdout.flush),
and you will see the print statements as they arrive, notably not waiting for the results
to finish.

You can use the zeromq SUBSCRIBE mechanism to only receive information from specific engines,
and easily filter by message type.

Authors
-------
* MinRK
"""

import sys
import json
import zmq

from jupyter_client.session import Session
from ipykernel.connect import find_connection_file

def main(connection_file):
    """watch iopub channel, and print messages"""

    ctx = zmq.Context.instance()

    with open(connection_file) as f:
        cfg = json.loads(f.read())

    reg_url = cfg['interface']
    iopub_port = cfg['iopub']
    iopub_url = "%s:%s"%(reg_url, iopub_port)

    session = Session(key=cfg['key'].encode('ascii'))
    sub = ctx.socket(zmq.SUB)

    # This will subscribe to all messages:
    sub.SUBSCRIBE = b''
    # replace with b'' with b'engine.1.stdout' to subscribe only to engine 1's stdout
    # 0MQ subscriptions are simple 'foo*' matches, so 'engine.1.' subscribes
    # to everything from engine 1, but there is no way to subscribe to
    # just stdout from everyone.
    # multiple calls to subscribe will add subscriptions, e.g. to subscribe to
    # engine 1's stderr and engine 2's stdout:
    # sub.SUBSCRIBE = b'engine.1.stderr'
    # sub.SUBSCRIBE = b'engine.2.stdout'
    sub.connect(iopub_url)
    while True:
        try:
            idents,msg = session.recv(sub, mode=0)
        except KeyboardInterrupt:
            return
        # ident always length 1 here
        topic = idents[0].decode('utf8', 'replace')
        if msg['msg_type'] == 'stream':
            # stdout/stderr
            # stream names are in msg['content']['name'], if you want to handle
            # them differently
            print("%s: %s" % (topic, msg['content']['text']))
        elif msg['msg_type'] == 'error':
            # Python traceback
            c = msg['content']
            print(topic + ':')
            for line in c['traceback']:
                # indent lines
                print('    ' + line)
        elif msg['msg_type'] == 'error':
            # Python traceback
            c = msg['content']
            print(topic + ':')
            for line in c['traceback']:
                # indent lines
                print('    ' + line)

if __name__ == '__main__':
    if len(sys.argv) > 1:
        pattern = sys.argv[1]
    else:
        # This gets the security file for the default profile:
        pattern = 'ipcontroller-client.json'
    cf = find_connection_file(pattern)
    print("Using connection file %s" % cf)
    main(cf)
    #import nsga_parallel
russelljjarvis commented 7 years ago

Since updating this thread last I may have encountered the source of some, but not all of these problems.

The way I established the problem to be an engine/controller inconsistency in the default directory was to make a very basic program as follows:

path_check.py contents below:

from ipyparallel require
import ipyparallel as ipp
rc = ipp.Client(profile='default')
dview = rc[:]
@require('os')
def path_consistancy():
    return(str(os.getcwd()))
C_local = path_consistancy()
C_remote = dview.apply(path_consistancy).get()
assert C_local == C_remote[0]

Its possible that the line: os.system('export IPYTHONDIR={0}'.format(os.getcwd())) might make launching the paths to be more robust. The path C_local, will be the directory this python program was launched from. The path C_remote[0] is the directory the command: ipcluster start -n 8 --profile=default is launched from, which could be an entirely different directory.

I had set the last line of the issue94 dockerfile to be:

ENTRYPOINT ipcluster start -n 8 --profile=default & sleep 5 & /bin/bash \n 
& ipython stdout_worker.py &

the critical part is ipcluster start -n 8 --profile=default

Its default behavior is to configure the clients/engines working directory then and there to be ``--workdir=pwd````

Where the value of pwd is of course the last value of WORKDIR in the Dockerfile.

I have decided this is not explicit enough, so rather than defining an entry point. I have instead created a bash alias that can be launched when you have navigated to your intended directory.

RUN echo "alias sc='export IPYTHONDIR=`pwd`; \n 
ipcluster start -n 8 --profile=default --workdir=`pwd` & \n
 sleep 5; python stdout_worker.py &'" >> ~/.bashrc

However, none of this explains why the workers directories cannot be rewritten by editing sys.path inside a dview.apply_sync or a dview.sync_imports() call. My sense is that just cannot update sys.path on the clients/engines in a persistent sort of way, and the directory you launch ipcluster from is the one the engines are stuck with. Which should be workable anyway.

The worst part about the situation is that you can of course update sys.path on the controller/hub, so models instantiated on the controller and models instantiated on the engines can be different types of objects (due to having different sources).

For different reasons, it might be desirable to create a view into the clients that excludes the controller anyway. The motivation for this is to leave the controller free from work, so it can better deal with interprocess communication.

rgerkin commented 6 years ago

Closing this because it is either fixed or is obsolete with the current version of the Backend implementation.