ros-drivers / pointgrey_camera_driver

ROS driver for Pt. Grey cameras, based on the official FlyCapture2 SDK.
128 stars 180 forks source link

Reconnect on error #79

Closed efernandez closed 7 years ago

efernandez commented 7 years ago

This implements the recovery strategy that was commented out, when an exception is thrown.

The strategy does: disconnect + connect + start. From my tests, there's no need to call stop before disconnect.

If any of the three steps fail, it's tried again after one second. From my tests, none of them fails.

I need this because after many hours running the camera, if fails to grab images or query the temperature property in the devicePoll loop: https://github.com/ros-drivers/pointgrey_camera_driver/blob/master/pointgrey_camera_driver/src/nodelet.cpp#L315

Since it's very difficult to reproduce, I've forced an error to happen every 10 iterations, when the TEMPERATURE property is read, with this code diff:

diff --git a/pointgrey_camera_driver/src/PointGreyCamera.cpp b/pointgrey_camera_driver/src/PointGreyCamera.cpp
index a40a415..eea8f3c 100644
--- a/pointgrey_camera_driver/src/PointGreyCamera.cpp
+++ b/pointgrey_camera_driver/src/PointGreyCamera.cpp
@@ -596,8 +596,15 @@ void PointGreyCamera::setTimeout(const double &timeout)

 float PointGreyCamera::getCameraTemperature()
 {
+  static int i = 0;
+  ++i;
   Property tProp;
   tProp.type = TEMPERATURE;
+  if ((i % 10) == 0)
+  {
+    std::cout << "Using bad property on purpose!" << std::endl;
+    tProp.type = static_cast<FlyCapture2::PropertyType>(999999); // @todo force error to check recovery
+  }
   Error error = cam_.GetProperty(&tProp);
   PointGreyCamera::handleError("PointGreyCamera::getCameraTemperature Could not get property.", error);
   return tProp.valueA / 10.0f - 273.15f;  // It returns values of 10 * K

After applying that change, we can reproduce/force an error to happen and see that the camera driver recovers gracefully, running this in different terminals:

$ roscore

$ rosrun nodelet nodelet manager

$ rosrun nodelet nodelet load pointgrey_camera_driver/PointGreyCameraNodelet manager

$ rosrun image_view image_view image:=/image_raw

The output would be like this:

Using bad property on purpose!
[ERROR] [1472676083.831936506]: PointGreyCamera::getCameraTemperature Could not get property. | FlyCapture2::ErrorType 1Feature lookup failed.
[INFO] [1472676083.835239900]: Disconnected from camera.
[INFO] [1472676083.852287185]: Connected to camera.
[ INFO] [1472676083.858863462]: Started camera.

:exclamation: If we don't handle the errors, as in the original code: https://github.com/ros-drivers/pointgrey_camera_driver/blob/master/pointgrey_camera_driver/src/nodelet.cpp#L371 the loop iterates very fast spamming the logs, with catastrophic consequences for the system (out of memory issues because of flooding /rosout mainly). It's important to note that the loop doesn't have any blocking functions other than grabImage, so when it fails, it returns immediately and we can have up to 1000 log errors per second.

@mikepurvis

efernandez commented 7 years ago

FYI, if I unplug the camera while the driver is running and I plug it again, it recovers gracefully with this PR. This is the output it'll show on the terminal:

[ERROR] [1472677524.876930403]: PointGreyCamera::getCameraTemperature Could not get property. | FlyCapture2::ErrorType 28Feature register read failed.
[ERROR] [1472677524.878818351]: Failed to disconnect with error: PointGreyCamera::disconnect Failed to disconnect camera | FlyCapture2::ErrorType 35Error stopping stream.
[ INFO] [1472677525.879065059]: Disconnected from camera.
[ERROR] [1472677528.121071034]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677529.121485640]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677530.121742309]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677531.122088810]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677532.122483199]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677533.122902717]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677534.123325582]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677535.123758305]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677536.124192551]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677537.124654832]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677538.125079421]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677539.125474241]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677540.125748367]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677541.126126094]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677542.126536094]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677543.126926990]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ERROR] [1472677544.127358681]: Failed to connect with error: PointGreyCamera::connect Failed to get first connected camera | FlyCapture2::ErrorType 19Failed to get the camera from the index.
[ INFO] [1472677548.174125225]: Connected to camera.
[ INFO] [1472677548.222552638]: Started camera.

Note this was run exactly with this PR, ie. w/o the hack to force an error when reading the TEMPERATURE.

mikepurvis commented 7 years ago

Looks good, thanks for the thorough testing.

efernandez commented 7 years ago

@mikepurvis I've updated the PR:

  1. Now connectCb() simply starts the devicePoll() loop in DISCONNECTED state. This has the benefit that we have error handling and recovery if something wrong happens in the callback. Before this PR, any error there blocks the callback.
  2. Adds states to the devicePoll(): ERROR -> [ STOPPED -> ] DISCONNECTED -> CONNECTED -> STARTED. This implements the different transitions and still allows the thread running devicePoll() to get interrupted. The diagnostics are also updated on every iteration.
  3. Cache the last configuration, so we keep the configuration in the error recovery. This seems to be also important to send the camera to idle when no subscriber is connected. I've tested that the LED intensity goes down when no subscriber is connected.
  4. The loop detects if the state changed, so we don't repeat error messages more than once.

I've unplug and plug as fast as I could, even during a recovery sequence, and the driver always comes back; note that it's not very easy to do this fast because the USB connector plugs in quite tight.

I've also try to start the driver before plugging in the camera. This still can leave the camera and the nodelet in a bad state. It needs the whole nodelet manager to be restarted. However, this is out of the scope of this PR and it's a very uncommon case.

If you're OK with the new implementation, please give me a :+1: . Thanks for your comments.

mikepurvis commented 7 years ago

LGTM 👍

Roboticom314 commented 7 years ago

Does the fix for this issue include cases like:

I ran 0.12.2 tag and then disconnected the camera while looking at the image_color topic in RVIZ. I was expecting the driver to reconnect and start publishing the image_color topic again but it didn't. I do see from the camera_nodelet_manager an Info message "Connected to camera." sent every second or so to the ROS console.

efernandez commented 7 years ago

@Roboticom314 I tested the "temporary power loss to the camera but ROS driver still running" use case and it worked for me. Do you mind opening an issue with more details?

  1. Camera model you're using
  2. ROS version
  3. Ubuntu (or Linux) distro

It'd be difficult to reproduce if we don't have the same camera, but at least we could try with the same software, or similar.

Roboticom314 commented 7 years ago

@efernandez Sorry for taking so long to reply.

  1. Bumblebee2
  2. Indigo
  3. Ubuntu 14.04

If you need more info, let me know.

efernandez commented 7 years ago

I haven't tried with a Bumblebee2, and I'm afraid I don't have any available to me ATM. Could you please create an issue a provide those details and also the steps to reproduce it?

This is a merged PR, so it's not really the place to discuss this. Simply reference it from the issue. :smiley:

Hopefully someone else can try with a Bumblebee2. Or I can figure out what's wrong from the terminal output, but from your comment it could be anything on the FlyCapture reconnection. Maybe try with newer versions of FlyCapture.