Lock less aggressively slot_map_mtx_, so sending a new goal when canceling doesn't freeze MBF

magazino / move_base_flex

Move Base Flex: a backwards-compatible replacement for move_base

https://uos.github.io/mbf_docs/

BSD 3-Clause "New" or "Revised" License

434 stars 154 forks source link

Lock less aggressively slot_map_mtx_, so sending a new goal when canceling doesn't freeze MBF #273

Closed corot closed 3 years ago

corot commented 3 years ago

We are locking the access to concurrencyslots map when doing any operation, as start and cancel. That's not needed. Interators to map don't get invalidated by adding new entries to the map, and we are not removing concurrency slots other than in shutdown.

With the current master, if we receive a new goal while performing a lengthy cancel (e.g. because we want to handle a smooth stop, as requested on this PR), MBF will freeze, as we lock twice slot_mapmtx, here and here.

With this PR, we accept the new goal, wait for the previous cancel to complete, and then execute the new goal.

dorezyuk commented 3 years ago

I like the idea, but the mutex also protected us from concurrent access to the members of the iterator.

From the quick look we might for example run into problems when calling start and cancel "concurrently" (assuming we first call start).

// inside AbtractActionBase::start
// if slot_it is newly created the execution is nullptr
      slot_it->second.execution = execution_ptr;

// inside AbtractActionBase::cancel
    concurrency_slots_[slot].execution->cancel();

would be a nullptr deref.

If its not a nullptr, we still might call cancel on a execution, which hasn't started and would then miss the cancel

corot commented 3 years ago

If its not a nullptr, we still might call cancel on a execution, which hasn't started and would then miss the cancel

check added. we can take care of the possible corner cases; but getting frozen as now is not an option! :skull:

dorezyuk commented 3 years ago

@corot Do you mean with "freeze" as in deadlock (the node then just sits there and never recovers)? If yes, can you explain how this is happening?

corot commented 3 years ago

@corot Do you mean with "freeze" as in deadlock (the node then just sits there and never recovers)? If yes, can you explain how this is happening?

Yep, the node freezes and don't even react to rosnode kill signal. To be honest, I cannot explain how this happens, because even when we return the cancel, we don't recover (I would expect that the new goal starts then, as the lock is released,,, but somehow this doesn't happen! we never release the lock)

Reproducing is easy, though. Implement the cancel method of your controller, e.g.

  bool cancel() {
    for (int i = 0; i < 50; ++i) {
      ROS_INFO("cancel %d...",i+1/5);
      ros::spinOnce();
      ros::Duration(0.1).sleep();

    }
    ROS_INFO("cancel DONE!");
    return true;
  };

The spinOnce is important, because otherwise MBF won't call start and cancel simultaneously. This sounds like solving the problem, but then your controller won't get ANY info, e.g. velocity feedback.

Then u send any path to exe_path, cancel it and send it again within 5 seconds

corot commented 3 years ago

After reconsidering this PR, I have another proposal for handling smooth cancel, without the need of blocking the cancel callback #274. Early feedback more than welcome!! :pray:

corot commented 3 years ago

Closed in favor of #274