ros-industrial / ros2_canopen

CANopen driver framework for ROS2
https://ros-industrial.github.io/ros2_canopen/manual/rolling/
159 stars 71 forks source link

SDO requests called from event callback result in a deadlock #156

Open bjsowa opened 1 year ago

bjsowa commented 1 year ago

This was mentioned by me in #146 but I suppose it deserves a separate issue

in the event mode, SDO requests (using universal_set_value and universal_get_value methods) called from poll_timer_callback do not work. In my case the first request results in a timeout, the second one hangs on this line: https://github.com/ros-industrial/ros2_canopen/blob/master/canopen_base_driver/include/canopen_base_driver/lely_driver_bridge.hpp#L443 Waiting for sdo_cond condition variable to be notified.

hellantos commented 1 year ago

It is not a bug but expected behavior.

The event mode requires the poll timer callback to be non-blocking and return immediately (real-time behavior). The poll timer callback is executed in the CANopen event loop and triggered by the sync event. The handshake nature of SDO leads to any calls to it to be blocking. This will block the event loop and thus hang the whole thing.

Event mode is more of an advanced function when you need high performance (leads to ~60% less CPU usage on our PCs). However it requires the user to make sure that in the update loop only process data objects are exchanged in the poll timer callback. For high performance setups this should anyways be the case to avoid non-determinism through SDOs.

We'll probably need to add some documentation about which objects have to be mapped to PDOs to make this event mode work.

bjsowa commented 1 year ago

The event mode requires the poll timer callback to be non-blocking and return immediately (real-time behavior). The poll timer callback is executed in the CANopen event loop and triggered by the sync event. The handshake nature of SDO leads to any calls to it to be blocking. This will block the event loop and thus hang the whole thing.

Thanks for clarifying that out. Submitting SDO requests from event loop indeed blocks the thread and prevents the SDO request from being processed. I still don't believe this should be an expected behavior.

As a POC I wrote a method for Synchronous SDO that can be used from Fiber driver executor task:

  template <typename T>
  bool sync_sdo_write_typed_from_task(
    uint16_t idx, uint8_t subidx, T value, std::chrono::milliseconds timeout)
  {
    lely::COSub * sub = this->dictionary_->find(idx, subidx);

    if (sub == nullptr)
    {
      std::cout << "sync_sdo_write_typed_from_task: id=" << (unsigned int)this->get_id() << " index=0x"
                << std::hex << (unsigned int)idx << " subindex=" << (unsigned int)subidx
                << " object does not exist" << std::endl;
      return false;
    }

    lely::canopen::SdoFuture<void> fut = this->AsyncWrite(idx, subidx, value, timeout);

    std::error_code ec;
    this->Wait(fut, ec);

    if (ec)
    {
      std::cout << "sync_sdo_write_typed_from_task: id=" << (unsigned int)this->get_id() << " index=0x"
                << std::hex << (unsigned int)idx << " subindex=" << (unsigned int)subidx
                << " timed out." << std::endl;
      return false;
    }
    else
    {
      std::scoped_lock<std::mutex> lck(this->dictionary_mutex_);
      this->dictionary_->setVal<T>(idx, subidx, value);
    }

    return true;
  }

Using this method doesn't seem to hang the event loop. The problem is I haven't found a way to check if something is executed from the executor task or a separate thread.

hellantos commented 1 year ago

Hi, that is a good idea. Honestly, I have not thought about it. This will yield the fiber that the sync_sdo_write_typed_from_task is called from until sdo has returned and thus not block great idea!

Does this fail if we don't execute in task? Or why do you need to know if it is run in a seperate thread? AsyncWrite and Wait should be threadsafe. We could simply use this instead of sync_sdo_write_typed.

Still, from a performance point of view it is not a good idea to use SDO in the real-time update loop at high frequency. For these one should use PDO.

bjsowa commented 1 year ago

Does this fail if we don't execute in task? Or why do you need to know if it is run in a seperate thread? AsyncWrite and Wait should be threadsafe. We could simply use this instead of sync_sdo_write_typed.

The documentation for Wait method says:

This function MUST only be called from tasks submitted to the executor associated with this driver.

That's why I think we have to distinguish situations when we run the function from executor task or from another thread.

Still, from a performance point of view it is not a good idea to use SDO in the real-time update loop at high frequency. For these one should use PDO.

I agree. Though missing a PDO mapping in a slave configuration is a pretty easy mistake to make and I think the user should be warned in such case. The execution should either be stopped instead of hanging or preferably continue using SDO instead of PDO using the method I proposed.

hellantos commented 1 year ago

That's why I think we have to distinguish situations when we run the function from executor task or from another thread.

It could be that we can use FiberThread. Probably requires some experimentation, but seems promising. I would suppose if already returns true, we know we are in Executor. It would probably need to be called with ev::FiberFlag::SAVE_ERROR.

I agree. Though missing a PDO mapping in a slave configuration is a pretty easy mistake to make and I think the user should be warned in such case. The execution should either be stopped instead of hanging or preferably continue using SDO instead of PDO using the method I proposed.

Not 100% sure on how to achieve this. Easy way would be a warning when using event mode.

hellantos commented 1 year ago

It could be that we can use FiberThread. Probably requires some experimentation, but seems promising. I would suppose if already returns true, we know we are in Executor. It would probably need to be called with ev::FiberFlag::SAVE_ERROR.

I tried this now. It does not work. Seems we'd need to patch lely_core_libaries for it. Other option is to pass down wether it is polling or event mode as parameter in poll_timer_callback.