Closed russelljjarvis closed 6 years ago
@russelljjarvis Could you make the minimal recipe that raises this error? There is a lot going on here so I don't know how to reproduce it on my end.
@russelljjarvis I've added a check in sciunit for this attribute, so that the whole unpicklable
check will be skipped if that attribute doesn't exist. But really, it always should exist so I haven't solved the underlying problem (although your code might work now). That's why I'd like a recipe so I can tackle this problem once and for all.
@russelljjarvis I pushed some new changes to scidash/neuronunit/dev that might help. They help ensure that all objects can be found again on unpickling.
However, I realize there is another issue I need a better grasp of, which is what kinds of things in model initialization (with the NEURON backend) should happen once on the host or one worker and just be pushed passively to others (e.g. simple attributes like a resting potential value) and what kinds of things need to be done independently on each worker. That will determine the structure and content of set_backend, load_model, etc. The nature of pickling requires that we put the first category of things in init (and functions it calls) and the second category of things in new (and functions it calls). This is because both new and init are called on model instantiation, but only new is called on unpickling. And new cannot simply call init, for various reasons. So in order to separate these things, it would help to understand what things you what acts must occur separately for each worker (e.g. maybe self.ns = nrn.NeuronSimulation(self.tstop, dt=0.0025)
), so I can make sure those things get walled off from the other things that can be simply copied via pickle/unpickle from one worker to the next. Can you help with that?
This seems like a very complicated issue. I will need to read over it several times.
@russelljjarvis I think I finally have a solution to all of this. At least I have gotten past all of the pickling errors, and my computer is running hot working on the optimization. I don't know it is entirely fixed, but if it doesn't finish or results in an error it doesn't seem to be related to simple pickling problems.
1) Rebuild your stacks from scidash/docker-stacks/master (I largely did not use the ParallelPyNeuron
version but will eventually need to add your changes in; you might be able to do it just from the ParallelPyNeuron
version but I haven't tested that). This rebuild should involve getting the latest changes to sciunit and neuronunit dev branches. Or you can mount them to /home/jovyan/mnt/sciunit
and /home/jovyan/mnt/neuronunit
if you want to work with the (updated) versions on your local machine.
2) Use the scidash
branch of ParallelPyNeuron
where I have update the reproduce_error/Dockerfile
to include a few things that are needed.
3) Checkout the dev
branch ofrgerkin/informatics_poster
and place it inside ParallelPyNeuron/reproduce_error
next to the Dockerfile. This will contain the update nsga_parallel_no_vm.py
file.
4) Do docker build /path/to/ParallelPyNEURON/reproduce_error -t error
5) Do docker run -it error
Optionally, add -v /path/to/sciunit:/home/jovyan/work/sciunit
and -v /path/to/neuronunit:/home/jovyan/work/neuronunit
to the lines above to use your local copies of sciunit and neuronunit if they are updated.
This will at least get you past many of the annoying errors that have been plaguing us, but I don't know if it fully works yet because I don't really know what to look for other than the processor doing a lot of work and not getting any errors about attributes not being found.
@russelljjarvis Just to update, the nsga_parallel_no_vm.py
file did run to completion, and models
and pop
do contain values that seem like the kind of values that would have been returned by a successful run, but I'm not entirely sure how to check this. What would be a good line to add to the end of the script to check that the optimization in that file ran successfully?
@russelljjarvis Check that this solution (in the updated sciunit and neuronunit) works for you.
I think I have a working recipe, but I have not rigorously tested it for the above workflow.
Thats funny, I didn't intend to edit your comment below.
I noticed that now once you select issues assigned to you, you can sort precedence, by the highest number of replies and counter replies. That was a good way to filter for this thread. Which scored second.
Working on it.
Simplifications after you verify that the above formula works:
scidash/neuronunit
. Make a fresh branch called ghissue94
. russelljjarvis/neuronunit@dev
. russelljjarvis/parallelpyneuron/reproduce_error/Dockerfile
to e.g. scidash/docker-stacks/ghissue94/Dockerfile
, then optionally delete russelljjavis/parallelypyneuron
. nsga_parallel_no_vm.py
and its local dependencies (get_neab.py
and model_parameters.py
and any others you find you need) from rgerkin/informatics_poster/neuronunit/optimization@dev
to scidash/neuronunit/optimization@ghissue94
.scidash/docker-stacks/ghissue94/Dockerfile
so that lines 10,11,12,15 reflect nsga_parallal_no_vm.py
and its dependencies being in scidash/neuronunit/optimization
rather than in rgerkin/informatics_posters/neuronunit/optimization
. scidash/docker-stacks/ghissue94/Dockerfile
. russelljjarvis/neuronunit@dev
to scidash/neuronunit@ghissue94
and see if it still works. For this you will probably want to mount your local copy of scidash/neuronunit@ghissue94
into /home/jovyan/work/neuronunit
in Docker so you can just edit that instead of rebuilding the Docker image every time. russelljjarvis/neuronunit@dev
becomes obsolete and we can just make another pull request from scidash/neuronunit@ghissue94
to integrate all the relevant parts of your code. @rgerkin
On this issue I have realized that the reproduce_error
Docker container, and associated code above is using one version of backends on the hub/controller and a different version on the engines.
This is fine for pickling models and communicating them back to rank0 the hub/controller.
The result is a list where:
models[0]
is derived from the source /home/jovyan/mnt/neuronunit/backends.py
and
models[1:-1]
is derived from source: /home/jovyan/neuronunit/neuronunit/models/backends.py
Only models[1:-1] are transported from the engines/clients and back to the hub/controller. models[0] does not cause a problem because at this stage because it does not go anywhere.
The way I investigated this issue was to use the code file you referred to for viewing stdout/err on the engines see below for a reference back to that code.
Its in a file called: stdout_worker.py (contents of the file are at the bottom of this comment)
I run it with in the background python stdout_worker.py &
before running ipython -i nsga_parallel_no_vm.py
.
with the code thats executed in inside dview.map_sync
after the statements:
from neuronunit import models
if the next statement is:
print(models.__file__)
Then its /home/jovyan/neuronunit/neuronunit/models/__init__.py
on the engines/clients
and
/home/jovyan/mnt/neuronunit/models/__init__.py
or /home/jovyan/informatics_poster/neuronunit/models/__init__.py
On the hub/controller, depending on which code I run from.
The upshot of all of this, is getting consistency between the hubs paths and the clients paths remains an unsolved problem.
I have attempted to circumvent that problem by only making the model on the hub (where the paths are what I intend/expect), and then communicating the models to the workers/clients.
This approach works okay, however I am getting a strange error message at the moment that I am trying to work through:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/opt/conda/lib/python3.5/site-packages/ipyparallel/serialize/serialize.py in unpack_apply_message(bufs, g, copy)
199 kwargs = {}
200 for key in info['kw_keys']:
--> 201 kwarg, kwarg_bufs = deserialize_object(kwarg_bufs, g)
202 kwargs[key] = kwarg
203 assert not kwarg_bufs, "Shouldn't be any kwarg bufs left over"
/opt/conda/lib/python3.5/site-packages/ipyparallel/serialize/serialize.py in deserialize_object(buffers, g)
130 bufs = list(buffers)
131 pobj = buffer_to_bytes_py2(bufs.pop(0))
--> 132 canned = pickle.loads(pobj)
133 if istype(canned, sequence_types) and len(canned) < MAX_ITEMS:
134 for c in canned:
/opt/conda/lib/python3.5/copyreg.py in __newobj_ex__(cls, args, kwargs)
92 keyword-only arguments to be pickled correctly.
93 """
---> 94 return cls.__new__(cls, *args, **kwargs)
95
96 def _slotnames(cls):
/home/jovyan/neuronunit/neuronunit/models/__init__.py in __new__(cls, *args, **kwargs)
44 """
45 LEMS_file_path: Path to LEMS file (an xml file).
---> 46 name: Optional model name.
47 """
48 #print("Calling new")
/home/jovyan/neuronunit/neuronunit/models/__init__.py in set_backend(self, backend)
109
110 elif name is None:
--> 111 # The base class should not be called.
112 raise Exception(("A backend (e.g. 'jNeuroML' or 'NEURON') "
113 "must be selected"))
/home/jovyan/neuronunit/neuronunit/models/backends.py in init_backend(self, attrs)
130 self.h.load_file("stdgui.hoc")
131 #self.h.cvode.active(1)
--> 132 self.orig_lems_file_path = None
133 #pdb.set_trace()
134 #self.h.cvode.active
AttributeError: 'ReducedModel with NEURON backend with NEURON backe' object has no attribute 'orig_lems_file_path'
engine.5.error:
"""A script for watching all traffic on the IOPub channel (stdout/stderr/pyerr) of engines.
This connects to the default cluster, or you can pass the path to your ipcontroller-client.json
Try running this script, and then running a few jobs that print (and call sys.stdout.flush),
and you will see the print statements as they arrive, notably not waiting for the results
to finish.
You can use the zeromq SUBSCRIBE mechanism to only receive information from specific engines,
and easily filter by message type.
Authors
-------
* MinRK
"""
import sys
import json
import zmq
from jupyter_client.session import Session
from ipykernel.connect import find_connection_file
def main(connection_file):
"""watch iopub channel, and print messages"""
ctx = zmq.Context.instance()
with open(connection_file) as f:
cfg = json.loads(f.read())
reg_url = cfg['interface']
iopub_port = cfg['iopub']
iopub_url = "%s:%s"%(reg_url, iopub_port)
session = Session(key=cfg['key'].encode('ascii'))
sub = ctx.socket(zmq.SUB)
# This will subscribe to all messages:
sub.SUBSCRIBE = b''
# replace with b'' with b'engine.1.stdout' to subscribe only to engine 1's stdout
# 0MQ subscriptions are simple 'foo*' matches, so 'engine.1.' subscribes
# to everything from engine 1, but there is no way to subscribe to
# just stdout from everyone.
# multiple calls to subscribe will add subscriptions, e.g. to subscribe to
# engine 1's stderr and engine 2's stdout:
# sub.SUBSCRIBE = b'engine.1.stderr'
# sub.SUBSCRIBE = b'engine.2.stdout'
sub.connect(iopub_url)
while True:
try:
idents,msg = session.recv(sub, mode=0)
except KeyboardInterrupt:
return
# ident always length 1 here
topic = idents[0].decode('utf8', 'replace')
if msg['msg_type'] == 'stream':
# stdout/stderr
# stream names are in msg['content']['name'], if you want to handle
# them differently
print("%s: %s" % (topic, msg['content']['text']))
elif msg['msg_type'] == 'error':
# Python traceback
c = msg['content']
print(topic + ':')
for line in c['traceback']:
# indent lines
print(' ' + line)
elif msg['msg_type'] == 'error':
# Python traceback
c = msg['content']
print(topic + ':')
for line in c['traceback']:
# indent lines
print(' ' + line)
if __name__ == '__main__':
if len(sys.argv) > 1:
pattern = sys.argv[1]
else:
# This gets the security file for the default profile:
pattern = 'ipcontroller-client.json'
cf = find_connection_file(pattern)
print("Using connection file %s" % cf)
main(cf)
#import nsga_parallel
Since updating this thread last I may have encountered the source of some, but not all of these problems.
The way I established the problem to be an engine/controller inconsistency in the default directory was to make a very basic program as follows:
path_check.py contents below:
from ipyparallel require
import ipyparallel as ipp
rc = ipp.Client(profile='default')
dview = rc[:]
@require('os')
def path_consistancy():
return(str(os.getcwd()))
C_local = path_consistancy()
C_remote = dview.apply(path_consistancy).get()
assert C_local == C_remote[0]
Its possible that the line:
os.system('export IPYTHONDIR={0}'.format(os.getcwd()))
might make launching the paths to be more robust.
The path C_local, will be the directory this python program was launched from.
The path C_remote[0] is the directory the command: ipcluster start -n 8 --profile=default
is launched from, which could be an entirely different directory.
I had set the last line of the issue94 dockerfile to be:
ENTRYPOINT ipcluster start -n 8 --profile=default & sleep 5 & /bin/bash \n
& ipython stdout_worker.py &
the critical part is ipcluster start -n 8 --profile=default
Its default behavior is to configure the clients/engines working directory then and there to be ``--workdir=
pwd````
Where the value of pwd
is of course the last value of WORKDIR in the Dockerfile.
I have decided this is not explicit enough, so rather than defining an entry point. I have instead created a bash alias that can be launched when you have navigated to your intended directory.
RUN echo "alias sc='export IPYTHONDIR=`pwd`; \n
ipcluster start -n 8 --profile=default --workdir=`pwd` & \n
sleep 5; python stdout_worker.py &'" >> ~/.bashrc
However, none of this explains why the workers directories cannot be rewritten by editing sys.path inside a dview.apply_sync
or a dview.sync_imports()
call. My sense is that just cannot update sys.path on the clients/engines in a persistent sort of way, and the directory you launch ipcluster from is the one the engines are stuck with. Which should be workable anyway.
The worst part about the situation is that you can of course update sys.path on the controller/hub, so models instantiated on the controller and models instantiated on the engines can be different types of objects (due to having different sources).
For different reasons, it might be desirable to create a view into the clients that excludes the controller anyway. The motivation for this is to leave the controller free from work, so it can better deal with interprocess communication.
Closing this because it is either fixed or is obsolete with the current version of the Backend implementation.
Pickeling is done by ipyparalllel its necissitated because this is the way workers get method arguments in a call to
Interestingly its only
suite.judge
that can't be passed between workers using serialization,suite.test
is fine.