Allow some chains in multi chain samplers to fail during initialization

SteveBronder commented 8 months ago

Summary:

Currently, if any of the chains fail to initialize in the multi chain samplers then the entire program fails and nothing is returned. It would be nice if we moved the initialization code to within the parallel_for loop and even if some chains fail we still allow the other chains to sample. Sometimes setting good initial values for a model is difficult so it can be useful to allow some chains to fail in initialization while other chains succeed.

If we do this then we should think of a way to write to the output file to let it know the chain failed to initialize.

On the other hand forcing good inits means that users models start in a nicer place and will most likely sample better. But that could also backfire and users could set all their chains to start at the same location. So it may be better to let users spawn many chains and let some fail.

I think doing this would be pretty easy. We would need to modify the code below in all of the samplers from

    for (int i = 0; i < num_chains; ++i) {
      rngs.emplace_back(util::create_rng(random_seed, init_chain_id + i));
      cont_vectors.emplace_back(util::initialize(
          model, *init[i], rngs[i], init_radius, true, logger, init_writer[i]));
     // ...

to

    for (int i = 0; i < num_chains; ++i) {
      auto rng = util::create_rng(random_seed, init_chain_id + i);
      try {
        cont_vectors.emplace_back(util::initialize(
            model, *init[i], rng, init_radius, true, logger, init_writer[i]));
      } catch (const std::exception& e) {
          logger.warn("information about failed chain init here")
          continue;
      }
      rngs.push_back(std::move(rng));
      successful_chain_ids.push_back(i);
     //...

Then for the parallel_for we do

    tbb::parallel_for(
        tbb::blocked_range<size_t>(0, successful_chain_ids.size(), 1),
        [stuff, &successful_chain_ids](const tbb::blocked_range<size_t>& r) {
          for (size_t i = r.begin(); i != r.end(); ++i) {
            auto idx = successful_chain_ids[i];
            // rest of code uses  idx instead of i 
          }

Current Output:

Program terminates if any chains in multi chain sampling fail

Expected Output:

If any chains succeed we continue and allow the successful chains to sample.

Additional Information:

Provide any additional information here.

Current Version:

v2.34.1

WardBrian commented 8 months ago

This alongside your comment linking to this in CmdStanR about this being a disadvantage of the single-process multi-chain feature made me look up what we do in cmdstanpy, which is we do not return samples if any chain fails (even in the legacy one-to-one chain-to-process set up)

Assuming we keep that behavior, it makes no difference to us what the underlying implementation does, besides perhaps wasting time if some chains still run.

One question I suppose is what is the process return code of the process when only some chains fail to init?

bob-carpenter commented 8 months ago

First, this is usually a bug in the Stan code. It's really common to miss the constraints required for a parameter, like not setting a scale parameter with a lower bound of zero. That gives you a 50-50 chance to initialize correctly. In those situations, we want to signal that the Stan code needs to be fixed.

Second, I don't think it'll be good for users to run a fraction of the chains they request. Maybe if some chains initialize, keep trying the other ones? But if a user wants 4 chains and they only get 3, they're going to have to run again.

In general, I think we should warn on bugs rather than try to cover them up.

The one place where init can be hard is in something like a high-dimensional regression. In those situations, I've found reducing the initial scale of initialization from (-2, 2) to a narrower interval gets rid of the problem.

SteveBronder commented 8 months ago

I'm leaning on the side of 'no' now that I'm thinking about this more. Bob I agree usually if I'm getting inits that bad idt running chains over and over is going to fix it and I probably need to think more about the model. Going to close this

stan-dev / stan