[Bug]: Casadi solver hangs on infeasible parameter combinations with 'surface form = differential'

bardsleypt commented 2 years ago

PyBaMM Version

22.6

Python Version

3.9.12

Describe the bug

Upon instantiating a DFN model with the 'surface form': 'differential' option, the Casadi solver can occasionally get stuck during setup if a strange parameter set is passed in during simulation. Moreover, this hanging also seems to lock up control of the computational thread as I can't seem to elegantly timeout the process (e.g., after wrapping in thread+queue+timeout routine). The scikits solver does not encounter this issue, and 'correctly' errors out without locking up the thread. This also only seems to get stuck when the surface form option is set, for the standard DFN the Casadi solver doesn't encounter the issue. Is this a known limitation of Casadi?

I've attached the MWE below, it seems pretty easy to reproduce. I've tested this on both Mac and Linux now and they both encounter the same hanging issue.

Steps to Reproduce

import pybamm
pybamm.set_logging_level('INFO')

model = pybamm.lithium_ion.DFN(options={'surface form': 'differential'})
# model = pybamm.lithium_ion.DFN()
params = pybamm.ParameterValues('Ramadass2004')
params.update({'Negative electrode active material volume fraction': 0.6, 
               'Negative electrode porosity': 0.1})
solver = pybamm.CasadiSolver()
# solver = pybamm.ScikitsDaeSolver()

exp = pybamm.Experiment(['Charge at C/2 until 4V'])

sol = pybamm.Simulation(model, experiment=exp, parameter_values=params).solve(solver=solver, initial_soc=0)

A zipped-version of the ipynb is here as well. stalling_behavior.ipynb.zip

Relevant log output

Following is a copy/paste of my logging output after ~10 minutes.

2022-07-14 15:38:29.316 - [INFO] parameter_values.process_model(380): Start setting parameters for Electrode-specific SOH model
2022-07-14 15:38:29.398 - [INFO] parameter_values.process_model(483): Finish setting parameters for Electrode-specific SOH model
2022-07-14 15:38:29.399 - [INFO] discretisation.process_model(137): Start discretising Electrode-specific SOH model
2022-07-14 15:38:29.408 - [INFO] discretisation.process_model(254): Finish discretising Electrode-specific SOH model
2022-07-14 15:38:29.409 - [INFO] base_solver.solve(880): Start solving Electrode-specific SOH model with Algebraic solver (lm)
2022-07-14 15:38:29.409 - [INFO] base_solver.set_up(109): Start solver set-up
2022-07-14 15:38:29.419 - [INFO] base_solver.set_up(730): Finish solver set-up
2022-07-14 15:38:29.422 - [INFO] base_solver.solve(1153): Finish solving Electrode-specific SOH model (success)
2022-07-14 15:38:29.422 - [INFO] base_solver.solve(1154): Set-up time: 10.416 ms, Solve time: 2.183 ms (of which integration time: 1.841 ms), Total time: 12.599 ms
2022-07-14 15:38:29.424 - [INFO] callbacks.on_experiment_start(166): Start running experiment
2022-07-14 15:38:29.425 - [INFO] parameter_values.process_model(380): Start setting parameters for Doyle-Fuller-Newman model
2022-07-14 15:38:29.493 - [INFO] parameter_values.process_model(483): Finish setting parameters for Doyle-Fuller-Newman model
2022-07-14 15:38:29.493 - [INFO] discretisation.process_model(137): Start discretising Doyle-Fuller-Newman model
2022-07-14 15:38:29.789 - [INFO] discretisation.process_model(254): Finish discretising Doyle-Fuller-Newman model
2022-07-14 15:38:29.809 - [NOTICE] callbacks.on_cycle_start(174): Cycle 1/1 (19.930 ms elapsed) --------------------
2022-07-14 15:38:29.809 - [NOTICE] callbacks.on_step_start(182): Cycle 1/1, step 1/1: Charge at C/2 until 4V
2022-07-14 15:38:29.810 - [INFO] base_solver.set_up(109): Start solver set-up

valentinsulzer commented 2 years ago

I don't know how to stop the casadi solver once it's stuck, but what you could do to avoid it getting stuck is reduce the period of the experiment (this will come at the cost of performance)

bardsleypt commented 2 years ago

Thank you for the suggestion. Unfortunately I'm running simulations (optimizations) that will require certain precisions to be maintained so I am unable to modify the sampling period much.

I have found a workaround for my purposes - essentially if the solver hangs I don't want to continue with that parameter set anyway, so I simply timeout the simulation. For any others who may require something similar, I can provide details (just ping me), but the basic idea is to wrap the solution in a Process, send its outputs through a queue appropriately, and timeout the queue after a desired duration. Cruically, Threads did not work in this case, as the Casadi solver hanging seems to entirely lock up the computational thread so timing the thread out did not work. There is always the possibility I did something wrong here though.

Also, some interesting behavior I noticed: this hanging bug only appeared for me when I enabled the double layer option {"surface form": "differential"}. The same parameter set would simply cause a run-time error for the standard DFN model. However, enabling another sub-model also seemed to help avoid the hanging issue. For example, enabling SEI as well, {"surface form": "differential", "SEI": "constant"} the solver simply raises the run-time error instead of hanging.

I will leave this bug report open as I don't know how to correct this behavior (seems more Casadi related than PyBaMM), but for my purposes I have found an appropriate workaround. If the moderators deem appropriate, we can also close this report down. As always, happy to discuss and confer further.

Best, Patrick

valentinsulzer commented 2 years ago

Linking https://github.com/pybamm-team/PyBaMM/discussions/2468

valentinsulzer commented 2 years ago

@bardsleypt , can you share your workaround for timeout?

I think reducing dt_max should also help with this problem. Also, different solvers (scipy, scikit-ode) should work, but they will be slower.

The issue is internal to CasADi so it's hard to fix on our end, other than with these workarounds. We're in touch with the CasADi developers though, so it might get fixed properly in future

bardsleypt commented 2 years ago

Sure, I'll upload my solution tomorrow. But as a warning, its not really a workaround for this hanging problem in general, its just a workaround for my particular case. Essentially I timeout the solution and return np.nan after a certain period of time. I'll try and trim everything down to a min-repro-example and share it here.

bardsleypt commented 2 years ago

Alright, as promised here is my timeout-workaround. There are probably packages to leverage to do something similar (e.g., threads) but this at least accomplishes the timeout as long as the multiprocesses are 'forked' to get copies of the local variables.

I've also had to wrap this type of functionality within parallel wrappers (in Dask for example). When I did this, I encountered problems pickling/serializing some of the PyBaMM objects (SwigPy objects to be more specific). Another workaround for this issue was to use a the multiprocess package instead of the multiprocessing package in the definition of the TimeoutFunc class. I prefer the multiprocessing package as its more common, which is why I've included it in the attached code. But in case this pickling workaround is needed, commenting/uncommenting the respective imports should do the trick (and of course installing the non-built-in multiprocess package).

Please let me know if you need any clarification on anything here. I'm happy to assist on a PR for something like this, but honestly I don't know the best place to inject something like this in the PyBaMM codebase.

Best, Patrick

import pybamm
from multiprocessing import Queue, Process, set_start_method
from queue import Empty
# from multiprocess import Queue, Process, set_start_method
# from multiprocess.queues import Empty
pybamm.set_logging_level('INFO')

# Define TimeoutFunc wrapper class to timeout threads,
# can also create a decorator function for this wrapping
class TimeoutFunc(object):
    """
    TimeoutFunc is a callable class to wrap a function and provide timeout functionality

    Usage:
    def foo(x):
        time.sleep(100)

    timed_foo = TimeoutFunc(foo, timeout=10)
    timed_foo(2)
    """
    def __init__(self, func, timeout=None, timeout_val=None):
        """
        Instantiate callable class, essentially a wrapped function providing timeout
        functionality
        :param func: callable method
        :param timeout: None or time in seconds (float) until wrapped funciton is aborted
        """
        assert callable(func), 'Positional argument 1 must be a callable method'
        self.func = func
        self.timeout = timeout
        self.timeout_val = timeout_val

    def _queued_f(self, *args, queue, **kwargs):
        """
        Private function, executes function call and places results in a FIFO queue
        :param args: target function positional arguments
        :param queue: FIFO queue
        :param kwargs: target function keyword arguments
        :return: None, results placed in FIFO queue
        """
        result = self.func(*args, **kwargs)
        try:
            queue.put(result)
        except BrokenPipeError as exc:
            pass

    def __call__(self, *args, **kwargs):
        """
        Provides calling functionality of wrapped method
        :param args: target function positional arguments
        :param kwargs: target function keyword arguments
        :return: target function return
        """
        q = Queue(1)
        p = Process(target=self._queued_f, args=args,
                    kwargs={**kwargs, 'queue': q})
        p.daemon = True
        p.start()
        try:
            # Block for specified timeout or function return,
            # if q still empty after timeout then cancel execution and return None
            result = q.get(timeout=self.timeout)
        except Empty as exc:
            p.terminate()
            result = self.timeout_val
        return result

if __name__ == '__main__':
    set_start_method('fork')

    # Define function for runnig sim which can be wrapped in TimeoutFunc
    def run_sim():
        # Set up model, parameters, experiment which causes Casadi solver to hang
        model = pybamm.lithium_ion.DFN(options={'surface form': 'differential'})
        params = pybamm.ParameterValues('Ramadass2004')
        params.update({'Negative electrode active material volume fraction': 0.6,
                       'Negative electrode porosity': 0.1})
        exp = pybamm.Experiment(['Charge at C/2 until 4.2V'])
        solver = pybamm.CasadiSolver()

        return pybamm.Simulation(model=model, experiment=exp, parameter_values=params).\
            solve(solver=solver, initial_soc=0)

    # Define timeout version of sim function
    timeout_sim = TimeoutFunc(run_sim, timeout=2, timeout_val='I timed out')

    # Uncomment to observe hanging behavior
    # value = run_sim()

    # Run timeout version
    value = timeout_sim()
    print(f'Value after timeout: {value}')

MichaPhilipp commented 2 years ago

I'm experiencing a very similar bug that the Casadi solver hangs on an infeasible parameter combination. However, it doesn't depend on the 'surface form' in my case, in which i simulate SEI growth in a cell with very low initial SoC. I realised that Casadi hangs once the SEI would consume more Li from the anode that is initially stored there (negative electrode SOC drops close to/equal to/below 0). Maybe this also happens for another quantity in your case.

I tried to create a workaround by introducing a termination event to stop once the concentration in the anode drops below a critical value. This may be not ideal because the critical value needed (to stop the simulation before casadi crashes) is relatively high, depending on the amount of Li converted into the SEI each timestep. For more complex cycles one additionally has to delete the event so that it does not terminate regular discharging protocols too early.

@bardsleypt thanks for sharing your timeout approach! Sadly this does not work in my case. I'm also not able to reproduce a correct timeout from your MWE. I observe the hanging behavior, but using value = timeout_sim() will not run the simulation or the run_sim() function at all. This may be due the fact that i'm working on windows and therefore can't use set_start_method('fork'). Trying set_start_method('spawn') instead results in a timeout after the desired seconds, again without running the simulation/function.

bardsleypt commented 2 years ago

@MichaPhilipp, sorry to hear the timeout didn't work for you. I'm not all that familiar with running python in a Windows environment. You might have luck using the multiprocess package instead of multiprocessing. I actually don't need to explicitly set the start method when using multiprocess. However, I think under the hood it is setting the start method to fork anyway.

But it might be worth a shot in your case: you'll need to install the non-default multiprocess package, and then switch the imports from my MWE above.

RuiheLi commented 1 year ago

@bardsleypt Hi Patrick, I also encounter similar problems when doing parameter sweeping in degradation simulation. The scripts you share is a wonderful workaround. Thanks so much! I understand that the key part of your script is the try...except part in the __call__ function, which tries to get the result of the simulation within a predefined time. I wonder how I can include other possibilities such as pybamm.expression_tree.exceptions.ModelError, pybamm.expression_tree.exceptions.SolverError ? As they are also likely to happen? Can I just put in the __call__ function like: except ( pybamm.expression_tree.exceptions.ModelError, pybamm.expression_tree.exceptions.SolverError, Empty) as exc: Or some other ways around?

Big thanks in advance!

RuiheLi commented 1 year ago

I think I have found some solutions for my previous questions.

Now another question is that, based on Patrick's workaround, one thread is created during one simulation (Appoligize if I decsribe this wrong, I am not very familair with coding). Now I want to do parameter sweeping, so I wrap Patrick's workaround scripts into my main function Run_model, if I call the main function and run it in parallel: if __name__ == "__main__": pool = multiprocessing.Pool(6) processes = [pool.apply_async( Run_model, args=(Para_dict_i,)) for Para_dict_i in Para_dict_list] result = [p.get() for p in processes] It will give errors: AssertionError: daemonic processes are not allowed to have children I think this is because multiprocess doesn't allow to create another set of multiprocess inside one of its threads.

Does anyone have any further ideas or suggestions on this? Many thanks!

bardsleypt commented 1 year ago

@RuiheLi sorry for the slow response, I'm glad to hear you were able to get it working for the other Exception classes you mentioned above. In the past, I have used nested try/except solutions. So I would create a function handle that executed the simulation, and called that function within a try/except, where I would catch ONLY the pybamm exceptions, similar to the ones you mentioned. Then I would wrap this try/except itself in a new function handle, to then inject that in the TimeoutFunc class which handles the try/except for the timeout error. I guess it all depends on what you are trying to accomplish, but this worked for me. You could always just catch bare Exception in the catch statement (though most pythoneers, myself included, will say this isn't the best practice).

As for your second question, I've run into those same issues. My solution has been to either use the multiprocess package rather than multiprocessing (it is thread based and you don't get the same error... usually), and/or to use dask for the parallel wrapping of parameter sweeping. I can probably help more if you can share a small script that reproduces the error.

pybamm-team / PyBaMM