amcl failing with 'assert' errors in pf.c

mkhansenbot commented 5 years ago

Bug report

Required Info:

Operating System:
- Ubuntu 18.04
Version or commit hash:
- 1781a5b9780a6f85e24ba17cd9773b668ce21772
DDS implementation:
- CycloneDDS

Steps to reproduce issue

Run the ctest_loop script:

cd build/nav2_system_tests
/home/mkhansen/ros2_dev/navigation2_ws/src/navigation2/nav2_system_tests/scripts/ctest_loop.bash -c $count \
-o /home/mkhansen/ctest_data/cyclone/$folder/ctest_loop_results.txt \
-l /home/mkhansen/ctest_data/cyclone/$folder/ctest_loop_results.log \
-d rmw_cyclonedds_cpp

Expected behavior

Test should pass.

Actual behavior

Test fails: 4/100 times with this assertion error:

12: [amcl-6] [WARN] [amcl]: Waiting for map....
12: [amcl-6] [INFO] [amcl]: Received a 384 X 384 map @ 0.050 m/pix
12: [amcl-6] [INFO] [amcl]: Setting pose (11.271000): -2.000 -0.500 -0.000
12: [amcl-6] [INFO] [amcl]: createLaserObject
12: [test_system_node-12] [INFO] [nav2_tester]: Setting initial pose
12: [test_system_node-12] [INFO] [nav2_tester]: Publishing Initial Pose
12: [amcl-6] [INFO] [amcl]: initialPoseReceived
12: [amcl-6] [INFO] [amcl]: Setting pose (11.371000): -2.000 -0.500 -0.000
12: [test_system_node-12] [INFO] [nav2_tester]: Waiting for amcl_pose to be received
12: [amcl-6] amcl: /home/mkhansen/ros2_dev/navigation2_ws/src/navigation2/nav2_amcl/src/pf/pf.c:379: pf_update_resample: Assertion `i < set_a->sample_count' failed.
12: [ERROR] [amcl-6]: process has died [pid 4188, exit code -6, cmd '/home/mkhansen/ros2_dev/navigation2_ws/install/lib/nav2_amcl/amcl --ros-args -r __node:=amcl --params-file /home/mkhansen/ros2_dev/navigation2_ws/install/share/nav2_bringup/launch/nav2_params.yaml'].

Additional information

Looking at the code here: https://github.com/ros-planning/navigation2/blob/16fda6af269c4e8f209f9ff953e8c009b2d67350/nav2_amcl/src/pf/pf.c#L379

      for (i = 0; i < set_a->sample_count; i++) {
        if ((c[i] <= r) && (r < c[i + 1])) {
          break;
        }
      }
      assert(i < set_a->sample_count);

It looks like in the loop above the assert will increment the "i" variable until it is =set_a->sample_count in the case where the 'break' condition is never hit.

I'm not familiar with this pf_update_resample() logic but this seems like a boundary condition and it should be less than or equal:

assert(i <= set_a->sample_count);

SteveMacenski commented 5 years ago

I would be curious to know if you chose another r if it would not fail, ei place that for loop into a another for loop of 2 attempts, and see if it drops to 0 failures from no resampling failures. If it does, that's concerning, and maybe that for should be wrapped in a while. If not, could it be some interference between runs? It is important that it finds some non-last value

mkhansenbot commented 5 years ago

@SteveMacenski - I'm not familiar with this logic, so I don't know exactly what it's trying to do here. Why is it important that it finds some value before it hits the end of the loop? Should it generate a new random r and then keep retrying? I could try that.

mkhansenbot commented 5 years ago

BTW - I tried changing the condition to <= : assert(i <= set_a->sample_count);

That get's past that assertion but then fails on the next one:

      assert(sample_a->weight > 0);

Changing that to be >= seems to 'fix' these failures, based on my testing so far (50+ runs without a failure). However, if those are only working around the symptoms, not the cause, they may not be the right fixes. I'm open to better fixes. As I stated above, I don't know these algorithms / this code.

SteveMacenski commented 5 years ago

I'd recommend trying this below a few times and see if you can get 4/100 to drop to 0/100 for a few consecutive tries. I'm more concerned if that fixes it and then less concerned if it doesnt.

What this is doing is randomly sampling the particle space. Selects a number, tries to find some sample particle out of the set (obviously) within the distribution of probability values. This logic can probably be cleaned up at another time. Since c is an estimation of the PDF when you sample randomly (0,1] you're mostly within the center distribution of probabilities with some outliers. What's happening is that some random value is selected as so high, there currently arent any particles that are that high in the PDF, which after a new particle is born surprises me that is possible. I've personally never seen this in production, AMCLs been a champ for me in the past

    bool done = false;
    for (int tries = 0; tries != 2; tries++) {
      r = drand48();
      if (done) {break;}
      for (i = 0; i < set_a->sample_count; i++) {
        if ((c[i] <= r) && (r < c[i + 1])) {
          done = true;
          break;
        }
      }
    }

    assert(...)

mkhansenbot commented 5 years ago

Don't you need to get a new random number? Otherwise, you're retrying twice with the same data, won't you get the same result?

SteveMacenski commented 5 years ago

Yes, you are correct. I just wrote some pseudo-code in the ticket. Updating to show that. If that reduces the test failure to 1/1000 (run 0/100 10x and have 1 failure in the mix) then we should just stick a while loop there and call it a day, however this is indicative of potentially other problems given that I have run AMCL for months straight on fleets of robot and never had this occur. If this test is actually indicative of real performance that would mean on average a robot would seg fault every 100 meters of travel or so

mkhansenbot commented 5 years ago

If this test is actually indicative of real performance that would mean on average a robot would seg fault every 100 meters of travel or so

I don't think that's true. I suspect this may be an initialization issue, but not sure how to prove it. It seems to always fail early in the test before the robot has actually moved. It could be a bug somewhere else in the initialization code that only gets hit under some conditions. Have you never seen AMCL crash when first powering up / starting a robot? Either way, I'm happy to try this suggestion. I'll post back later with results.

mkhansenbot commented 5 years ago

Update: I just ran the localization test 100x and see this exact failure 7/100 times.

I will try the fix suggested and re-run that test.

SteveMacenski commented 5 years ago

I can't say I've ever seen that with the ROS1 version, no.

mkhansenbot commented 5 years ago

Re-run with the fix suggested, I still see this same failure 6/100 times.

SteveMacenski commented 5 years ago

Huh. Even with resampled r values? I can't really understand how that could be. If you have ~4/100 failures, then if you resample, those 4 each have then a ~4/100 chance of resampling to fail a second time, which should be 0.16/100, or ~2/1000 failures.

I'd expect on the order of 0-1 / 100 failures or 0-10/1000 failures which resampling

mkhansenbot commented 5 years ago

I added a printf to the code when this condition is hit, and here is what I see:

11: [amcl-5] Error: i = 2000, r = 0.779539, prev_r = 0.917729, c[i - 1] = 0.738366, c[i] = 0.738954

The problem appears to me to be that the r values can be greater than the last value in the c[i] array, which is set in a for loop above the failing code:

  c = (double *)malloc(sizeof(double) * (set_a->sample_count + 1));
  c[0] = 0.0;
  for (i = 0; i < set_a->sample_count; i++) {
    c[i + 1] = c[i] + set_a->samples[i].weight;
  }

There's nothing here that guarantees that the last c[i + 1] value will be larger than r = drand48();

As you can see in this case, the last value of c[i] is 0.738954, which means there's a >26% chance that r will be greater than that value. And since these c[i] values vary each run, that % will vary also.

I think the only way to 'guarantee' that this assert is never hit is to make this keep retrying values of r until it succeeds.

    bool found = false;
    while (!found) {
       r = drand48();
       for (i = 0; i < set_a->sample_count; i++) {
         if ((c[i] <= r) && (r < c[i + 1])) {
           found = true;
           break;
         }
      }
    }

mkhansenbot commented 5 years ago

I just ran 200x with that change and never see this failure. I'm going to submit a PR with this fix. I think the algorithm could be improved and fixed in a better way if someone with more knowledge than me on how particle filters work wants to take on that challenge. I'll note that in the PR.

SteveMacenski commented 5 years ago

That would definitely fix it, I'm also slightly concerned that you're seeing this on the order of 10 times out of 100 trials. A typical robot configuration updates amcl lets say every 1 meter, that would mean that you couldn't go 100 meters before failing.

At least for AMCL in ROS1, I can say that absoutely never happened in 100meters or 10km. For AMCL in Nav2, I don't have hardware to test. If you don't see it failing then there's some there issue here at hand and we shouldn't introduce code to make tests pass without understanding "why"

mkhansenbot commented 5 years ago

I see nothing different at least in the pf_update_resample code vs. what is there in ROS1.

We've also driven the robot around for quite a long time in simulation, as far as I know we haven't seen this issue before.

I noticed that the call into the pf_update_resample is modulated by this code in AMCL: (Edit: fixed link) https://github.com/ros-planning/navigation2/blob/f2a9fb7a32f2b913a6438f149f94cccfdfd08760/nav2_amcl/src/amcl_node.cpp#L622

    if(!(++resample_count_ % resample_interval_))
    {
      pf_update_resample(pf_);
      resampled = true;
    }

In our code, we're initializing resample_interval to 1 and in ROS, it is initialized to 2

So I made this change:

diff --git a/nav2_system_tests/src/localization/test_localization_launch.py b/nav2_system_tests/src/localization/test_localization_launch.py
index 004b6ba6..d76c32d9 100755
--- a/nav2_system_tests/src/localization/test_localization_launch.py
+++ b/nav2_system_tests/src/localization/test_localization_launch.py
@@ -52,7 +52,8 @@ def main(argv=sys.argv[1:]):
     run_amcl = launch_ros.actions.Node(
         package='nav2_amcl',
         node_executable='amcl',
-        output='screen')
+        output='screen',
+        parameters=[{'resample_interval': 2}])
     run_lifecycle_manager = launch_ros.actions.Node(
         package='nav2_lifecycle_manager',
         node_executable='lifecycle_manager',

And when I ran the test 200x, I don't see the failure anymore.

So, I'm starting to believe the pf_update_resample may not be called at all in the case where resample_interval=2, maybe the resample_count is being reset when it shouldn't. If that is also the case in ROS1, it should be possible to test both cases.

SteveMacenski commented 5 years ago

I looked at my home robots AMCL config, I've used 1 for awhile, but I do know I have assets that run >1 as well. <param name="resample_interval" value="1"/>

Did you have a printout/something telling you it never gets into that loop or just a feeling?

SteveMacenski commented 5 years ago

I don't see how that really fixes the behavior. If we resample every 2 updates rather than 1, I dont see how the distrubtion would drop a 6% failure rate to 0% reliably. Just the act of updating the motion model once between resamples does effect the distribution in any reliable way to make it stable. And then resampling numbers in the for loop I suggested to try to debug would have also added in a x^2 proportionate drop in failure rates.

I agree we should try your suggestion, but something feels off about that, but if it works, it works.

yathartha3 commented 5 years ago

@SteveMacenski

Since c is an estimation of the PDF

I think 'c' is the CDF, so any random sample between [0,1] should be valid.

@mkhansen-intel Maybe if ((c[i] <= r) && (r < c[i + 1])) should be ((c[i] <= r) && (r <= c[i + 1])), in case random number generated by drand() is 1.0, which is possible since drand48() returns values uniformly distributed over the interval [0.0 , 1.0].

SteveMacenski commented 5 years ago

I got my terminology wrong, yes.

I'm saying that your solution looks like something that is such a trivial change, that I can't believe the original authors overlooked it. That's why my mentality is trust, but verify, so try it out, and lets see what happens. You also make a great assumption that this is actually a true CDF, which... given other things I've found in AMCL, it questionable

SteveMacenski commented 5 years ago

To your edit, drand48 won't return [0, 1.0] its [0, 1.0): https://linux.die.net/man/3/drand48

What I'm guessing is happening is that it picks a samples like 0.99 or something but the sum of the PDF (AKA CDF) is actually only 0.98 or something from rounding errors, but I'd have to see alot of logs to make that declaration. It doesnt explain to me why running it with updating the model once between magically fixes it

mkhansenbot commented 5 years ago

Did you have a printout/something telling you it never gets into that loop or just a feeling?

I added an RCLCPP_INFO and when resample_interval=2 I don't see this printout, but when it is =1 I do. So I suspect in the first case, it is never resampling.

     if (!(++resample_count_ % resample_interval_)) {
+      RCLCPP_INFO(get_logger(), "***************************************************** RESAMPLING ****");
       pf_update_resample(pf_);
       resampled = true;
     }

Just to clarify - I'm not sure what the problem is exactly, just relaying information. I haven't submitted a PR with the fix I mentioned above as I'm still trying to get to root cause. At this point I'd rather not submit that PR at all if we can debug why this is happening.

If this is a CDF, then the last value in the array, in this case c[2000] is supposed to be 1.0 right? That's not what I see, as I mentioned in my comment above:

11: [amcl-5] Error: i = 2000, r = 0.779539, prev_r = 0.917729, c[i - 1] = 0.738366, c[i] = 0.738954

So I think it is not a true CDF. If that's the case, I don't know what has changed that since ROS1, as I don't think this function has really changed. Something may have changed in the underlying pf_init however, I'll have to take a deeper look next week.

SteveMacenski commented 5 years ago

I would definitely agree with the mentality of understanding the why before just patching it. At some point it may be worth just deleting AMCL and rewriting it but that's a time suck

mkhansenbot commented 5 years ago

This is becoming a time suck already

SteveMacenski commented 5 years ago

Well, I'm working on getting the resources required to make a 3D localizer / mapping stack, at that point my intention is to replace AMCL entirely with a 3D enabled version (for which you can still run in 2D with the special case of only 1 layer). Even then, I much prefer an optimization based localizer, but if someone really cares about the kidnapped robot problem, the PF is kind of necessary. As a pro-user AMCL gets tossed to the wayside very quickly but there is certainly a place and a time for that type of solution. Perhaps when I start getting around to that if it still aligns with your interests, we can collaborate on that work.

yathartha3 commented 5 years ago

To your edit, drand48 won't return [0, 1.0] its [0, 1.0): https://linux.die.net/man/3/drand48

Hmm, the first google result (https://pubs.opengroup.org/onlinepubs/007908799/xsh/drand48.html) says its [0,1]. Probably old. Also, I was referring to CDF as an approximation. At this point I am more interested in the optimization based localization, or just something better than AMCL.

SteveMacenski commented 5 years ago

I've built a couple graph based localizers, and they're good for pro-users that know where their robot is to start with, but if you get genuinely delocalized or don't know your starting point within some reasonable range, then I havent seen too many ways in graph-based methods that don't involve keeping huge buffers of data of fixing the kidnapped robot problem.

PF based methods (or filter based, for that matter) have their place and I think that's really what most people want. The optimizer based methods are to squeeze out those last drops of accuracy you can get at the cost of some of that flexibility because you don't want to maintain too large of a graph or you start dropping ability to do real-time at >30 hz. I could be wrong here but there's not a great amount of literature on it so this is mostly from my experience and chatting with colleagues at other companies, I'd love to hear alternative solutions to fix that and I'm all over dropping filter based localizers all together

yathartha3 commented 4 years ago

I added a printf to the code when this condition is hit, and here is what I see:
11: [amcl-5] Error: i = 2000, r = 0.779539, prev_r = 0.917729, c[i - 1] = 0.738366, c[i] = 0.738954
The problem appears to me to be that the r values can be greater than the last value in the c[i] array, which is set in a for loop above the failing code:

@mkhansen-intel This seems weird to me. At iteration 2000, the c value should sum upto 1, or atleast be very close to 1 since it is calculated by the addition of normalized weights I believe. When I ran it once, the final 'c' summed up to 1. It is going to fail if the last c value is like 0.74 that you are seeing. However, why that last c value is so small is what I am trying to understand. Saving the array of c values when it fails, along with the random number generated might give us a clue.

mkhansenbot commented 4 years ago

OK, I have no idea why this was happening before, but it's now not happening anymore, and I ran the localization test 300x yesterday and it passed every time. So I'm wondering if somehow my system libraries were mismatched or something specific to my system at the time I was seeing this. In any case, I cant' reproduce anymore, so I'm closing this. The good news is there's no bug, the bad news is it wasted a couple days of my time last month.

SteveMacenski commented 4 years ago

Didn’t other people see this too? That’s odd

ros-navigation / navigation2