Closed rgerkin closed 6 years ago
Note also this might not currently work due to passive tests failing because of neo/python3.5 interaction problems.
When we tried it out together, it appeared that the parallel rheobase test was working fine, but the other parallel part (running the other tests) was hanging. It seemed to work fine when that part was made serial. Can you follow up on this and figure out why it isn't working? I think that everything may work after this is figured out.
I currently have a branch that is working some of the time. This travis build shows a particular error in reading from get_neab.py It occurs sporadicly on different CPUs. https://travis-ci.org/russelljjarvis/neuronunit#L3460-L3709 I am getting the sense that the CPU running the job just waits there, and the rest of the CPUs continue on. Eventually the list dtcpop will be missing dtc objects when all the CPUs collate/gather their values back to CPU 0.
One work around might involve trying to store tests inside the dtc objects, this is problematic when the tests contain HOC variables, but perhaps it is possible to do as soon as the test suite object has been initialized, because it should not contain related_data, models, or predictions derived from HOC vectors at that point.
I believe I now have a build that runs reliably in the cloud. My solution was to eradicate all sources of parallel file writes (basically any dump(s)
statement even json dump(s)
).
https://travis-ci.org/russelljjarvis/neuronunit
This meant editing neuronunit/neuroelectro.py
such that dump
and dumps
where eradicated and I also did git add neuroelectro.* such that neuroelectro.dir, and neuroelectro.bak are available to the version of neuronuninit that travis installs.
The problem was only sporadicly occurring, since not every parallel file write leads to dead-lock but it is a common enough occurrence that it renders the optimizer unreliable.
To test you can just edit nparams, and npoints in test_exhaustive search https://github.com/russelljjarvis/neuronunit/edit/test_branch/unit_test/test_exhaustive_search.py#L24-L25 and then check back at https://travis-ci.org/russelljjarvis/neuronunit later. I have noticed that the travis implementation is not as fast as if it was running locally on native linux. I thought I should also point out, that the last job I configured through GH/travis with:
npoints = 3
nparams = 3
so there is a real possibility that travis will time out trying to complete such a big job.
Maybe we can make a list of all wrap all file writing commands in a special sciunit verison (e.g. use a new sciunit.json.dump
instead of json.dump
) which can be context aware, using a global flag that indicates whether it should be skipped, and we can set this flag whenever we do any parallel jobs.
Also, you referred to both dump
and dumps
as a source of problems, but dumps
just writes to a string, not to a file. Are you sure that dumps
is also a problem?
I am not actually sure, that it is the dumps
lines rather than just the dump, I will check more thoroughly tomorrow. I know the problem is with the writing operations from the neuroelectro.py file generally, but I agree with you the dumps should just be a memory operation. I think the last thing I changed in that file when it worked was json.dump. I feel like there is a possibility that some how a shared memory location is being corrupted by simultaneous writing.
I agree that there needs to be a context-aware way of doing these file writes. Also, it's not that parallel code can't do any file writing, there are different ways of handling parallel file writing in a distributed context, but they all pretty much involve getting all the CPUs to coordinate as if they are in a serial sequence.
It's tricky that get_neab.tests
can't (in it's current form) be transported between CPUs, but also instancing new get_neab.tests
objects contains some very minor file writes, or some kind of shared memory complication as alluded to above.
I have confirmed that travis runs a bigger nparam=3, npoints=3 job. Although the job status was errored, it actually completed the script, and then waited in idle, I just neglected to end the script with exit()
. I think that I have probably been experiencing this bug for a long time, but I was mis-attributing it to other causes.
It is just re-running again here: https://travis-ci.org/russelljjarvis/neuronunit/builds/298314871?utm_source=email&utm_medium=notification#L7814
Also I wrote the above late at night, there is another way to make parallel writing parallel. You can use the approach described here: https://github.com/scidash/neuronunit/issues/135
I believe the way to do parallel safe file writing, is to make temp files whose names are dependent on unique CPU labels, since each CPU writes to a different temp file, none of the CPUs can corrupt the others files. The only problem is file reads would have to depend on the same operation.
I have recently changed the subset code such that you can give a list of parameter keys, instead of being forced into using specific keys due to a default pattern of enumerating over the model_params dictionary.
@russelljjarvis What is the status of this goal?
All of this is working. I believe. My strategy is to read and write falls as few a times as possible.
This is the exhaustive search script we were discussing on Thursday.
It might be a good idea to edit line 77 to be consistent with the following two lines.
npoints = 1 nparams = 1 Just to verify on your machine that the script is producing expected output on a 1 parameter, 1 sample search. Also I have recently added in the the line, which filters out models where the rheobase current injection value is not greater or equal than 0pA. I used the built in filter function so the code was more explicit.
filtered_dtcpop = list(filter(lambda dtc: dtc.rheobase['value'] <= 0.0 , dtcpop))
All of the functionality is in the script now, you can choose how many sample points are used per parameter (with the npoints parameter),and you can choose the number of parameters that are searched with nparams. For example:
These are the entry points I use with docker to run the script: https://github.com/russelljjarvis/ParallelPyNEURON/blob/master/dev/on_entry_point.sh
setup( name='neuronunit', version='0.1.8.9', author='Rick Gerkin', author_email='rgerkin@asu.edu', packages=[ 'neuronunit', . . . neuronunit.optimization'],