hippware / wocky

Server side (Erlang/Elixir) software
13 stars 3 forks source link

Geofence mechanics algorithm shortcomings #1713

Closed bengtan closed 5 years ago

bengtan commented 6 years ago

I uncovered what I think is an instance of the geofence debounce algorithm not operating as expected. I may be wrong so it would be good to get a second opinion.

The user is 'don' (2344c2ce-a82b-11e6-b3a0-0e2ac49618c7) at Staging. The bot is 'The Stidham Residence' (5ac5fd44-4a69-11e8-a052-0a580a02017d) at Staging.

@mstidham reports (via #qa: https://hippware.slack.com/archives/C2V6L53TQ/p1532525126000194 ) that don entered and exited 'The Stidham Residence' but there was no entry nor exit push notification generated.

I checked push_logs and confirmed there was no entry nor exit push notification generated (to user at-miranda). So, why not? I investigated the user's location.

Trawling through user_locations, I was able to piece together the user's movements relative to 'The Stidham Residence':

// SELECT * FROM user_locations WHERE user_id='2344c2ce-a82b-11e6-b3a0-0e2ac49618c7' ORDER BY created_at DESC
"id","user_id","resource","lat","lon","accuracy","created_at","updated_at","is_fetch","speed","heading","altitude","altitude_accuracy","captured_at","uuid","is_moving","odometer","activity","activity_confidence","battery_level","battery_charging"
"BC08ED9B-80EE-4726-9895-CC480E195D5B","2344C2CE-A82B-11E6-B3A0-0E2AC49618C7","392DB611-BB57-4056-AB25-C87D4B9CD31D",36.4360348089006,-99.2569723539978,10.0,"2018-07-25 12:58:43.06968","2018-07-25 12:58:43.069706",false,29.89,279.49,555.5,24.0,"2018-07-25 12:57:06.906","AC411F9A-844E-4E4F-BE21-A40EF9FFA4E6",true,73942.9,"unknown",100,0.889999985694885,false
"9F73A3F3-D949-41B1-98C1-4018AE494F5D","2344C2CE-A82B-11E6-B3A0-0E2AC49618C7","392DB611-BB57-4056-AB25-C87D4B9CD31D",36.4211082738232,-99.2187590898007,5.0,"2018-07-25 12:45:33.829169","2018-07-25 12:45:33.829177",false,0.0,154.69,569.5,3.0,"2018-07-25 12:45:24.988","7FFE124B-7F31-4B81-B2B2-B4E4AF804952",false,70136.5,"unknown",100,0.959999978542327,false
"EFFCB287-FB2A-43DE-A2A0-DB8F718D901C","2344C2CE-A82B-11E6-B3A0-0E2AC49618C7","392DB611-BB57-4056-AB25-C87D4B9CD31D",36.421481729519,-99.2186979019075,10.0,"2018-07-25 12:45:33.308763","2018-07-25 12:45:33.308771",false,2.73,43.95,558.5,3.0,"2018-07-25 12:43:42.991","0A106ABB-9392-4F4C-BA07-7D0C6F3FB045",true,70094.7,"unknown",100,0.959999978542327,false

According to the location data points:

I think the following happened:

If there was a data point soon after 12:55:33 which was inside the bot (and presuming the debounce period is 10 minutes), debounce would have been satisfied and the algorithm would have generated an entry notification.

However, because the next data point was outside, no entry notification was generated.

It's a bit like an aliasing or sampling error.

So I think we have a bug here but before raising the alarm, I'd like to get a second opinion.

(It could be argued that this is a case of 'garbage in, garbage out'. The client wasn't sending timely location data points to the algorithm and the algorithm was only doing its best based on it's input. This is for discussion after confirmation of the suggested diagnosis.)

toland commented 6 years ago

Your analysis is essentially correct. The algorithm works slightly differently but comes to the same conclusion. That having been said, the algorithm is also behaving as designed. This is an incredibly difficult corner case to handle, and our choices here are about how to fail gracefully when the data is not sufficient to make a sound decision.

The first thing to note is that the algorithm only runs when it receives new location data points. While there is a notional idea of a debounce "timer," it is not implemented with an actual timer. What happens is, when a new data point arrives that indicates the user has changed state, and the user is in a "transition" state, we compare the current time with the time that the user entered the transition state. If the difference is larger than the debounce interval, and the change is in the same direction (i.e., in or out), then they have entered/exited the bot. If the change is in the opposite direction, then we delete the transition event and pretend it never happened.

That is what happened here. The user enters the bot perimeter and goes into a transition_in state. The next data point that comes in has the user outside of the bot, so the transition_in state is deleted, which leaves the user in the exited state that they were in before all this started.

  • At 12:58:43, a data point arrives but it indicates the user is outside the bot. The debounce algorithm looks at it and concludes that the user is now outside, and hence was inside the bot for only a few seconds around 12:45:33.
  • A few seconds is not long enough to satisfy debounce, so the algorithm concludes that the user neither entered nor exited the bot.

While it gets to the same place, the conclusions of the algorithm are slightly different. What actually happens is that the algorithm concludes that the user bounced, because there is no proof that the user was inside the bot for at least the debounce interval. The algorithm was designed to only send a push notification when we know for sure that the user is where we think they are. In this scenario, the user could have been inside the bot for the required amount of time (and, in fact, they apparently were), or they could have bounced off the edge of the bot's perimeter. We just don't know, and the algorithm defaults to not sending a message when the user's location is ambiguous.

And what else can it do? When implementing debouncing, I considered setting a timer for the debounce interval and notifying when that timer expired if there was no proof that the user exited the bot perimeter in the meantime. That presented some difficult technical challenges, but it also fails in exactly the way that we don't want. In this case, the algorithm would have notified correctly; but in a case where the user actually did bounce, then it would have sent a false positive. I have tried to err on the side of not sending false positives. If we know there are going to be failures, we probably shouldn't call attention to them 😄.

(It could be argued that this is a case of 'garbage in, garbage out'. The client wasn't sending timely location data points to the algorithm and the algorithm was only doing its best based on its input. This is for discussion after confirmation of the suggested diagnosis.)

It turns out that it is very difficult to implement any kind of sane debouncing strategy when the location data points are so far apart. We have to begin making assumptions in the absence of real data, and that is going to lead to either false positives (sending notifications that are unwarranted) or false negatives (not sending notifications when they are warranted). I have chosen to err on the side of not sending false positives. Sending false positives would effectively defeat the purpose of trying to do debouncing in the first place.

So I think we have a bug here but before raising the alarm, I'd like to get a second opinion.

I don't think it is a bug. First, the algorithm is behaving exactly as it was designed. Second, I don't see how we can handle the situation in an ideal manner. Sending a notification in this case would also mean sending a notification for a user who bounced in a different case.

The only real solution to this issue is more consistent location data. Short of that, we are just talking about which failure mode is less undesirable.

toland commented 6 years ago

It occurred to me while I was out to lunch that there are some optimizations to the algorithm that are possible now that we are collecting more information with each location update. I don't consider these solutions to the problem in the OP, but more like optimizations that can help us make better decisions faster in the future. Most of these ideas require consistent user location data, a long history of user locations to analyze, or both.

For example, the location update that put the user within the bot had the is_moving flag set to false. Perhaps we can infer from that fact that the user is very unlikely to bounce and skip debouncing. We are also tracking speed and heading now, and that may tell us more about whether the user is going to bounce. A user that is traveling quickly may be more likely to bounce, and a user that is also on an oblique heading with respect to the bot's perimeter may be even more likely to bounce.

An even better idea is to feed the speed, is_moving, activity, and activity_confidence fields into a Bayesian algorithm that could give us a probability that the user is going to bounce. We can improve the accuracy of that prediction by also taking into account the user's behavior with respect to that bot. If we know, for example, that the user bounces 93% of the time that they enter a certain bot's perimeter, we can be fairly sure they will bounce again. Combining all of this information to get a single prediction would be fairly straightforward, and, I think, reasonably accurate.

With that kind of probabilistic prediction model, we could create a rule that would skip debouncing if the probability of the user bouncing on this bot was less than a certain threshold (say 15%). With a slightly higher, but still low, probability we might use a shorter debounce interval.

This can help mitigate small or sporadic gaps in the data, but large and consistent gaps will still be problematic. Also, we would need a broad data set to train the model on. I am not sure that staging would be sufficient since the activity there mostly reflects QA.

The smart plan may be to start collecting the data and designing/training the algorithm now, even though it will probably be months before we can really turn it loose in production. I suspect this kind of thing will need a lot of time in the lab before it is ready for prime time.

bernardd commented 6 years ago

Phil's covered this more than adequately, but the super short version is this: How does the data there not look like exactly the kind of transient "bounce" (incorrect/brief location within/outside a bot) that the algorithm is meant to prevent? And if we do trust that one location enough to mark the user as having entered, why have a debounce algorithm at all?

I don't think it is a bug. First, the algorithm is behaving exactly as it was designed.

Exactly.

bengtan commented 6 years ago

Firstly, thanks for confirming my hypotheses. Understanding is the first step and it's good that we mostly agree.

It's become apparent to me that our debouncing strategy/algorithm has limitations due to the way rnbgl (the location upload library we're using) works. Or it may not even be due to rnbgl, rnbgl may just be trying to do it's best within the confines of iOS.

When we started 'Oh, let's do debouncing', we had some first-order assumptions on how location upload would work ie. we assume it would be periodic and timely. Now we are finding out rnbgl is anything but regular.

In fact, it could be argued that our debouncing strategy directly contravene's rnbgl's philosophy. Our strategy required periodic and timely updates. Rnbgl (or iOS) explicitly tries to send as little data points as possible (in order to conserve battery, or to obey iOS).

We didn't know that then. We know that now.

Based on what (little?) we knew at the time, I think the existing debounce algorithm is fine. You implemented the best that you could, based on the (unwritten) assumptions we had. It's only now that we are discovering that our assumptions are questionable.


For example, the location update that put the user within the bot had the is_moving flag set to false. Perhaps we can infer from that fact that the user is very unlikely to bounce and skip debouncing.

I had a similar thought overnight.

Based on my (admittedly limited) anecdotes so far, the is_moving flag appears to be very informative ie. we can give it a lot of weight as an input in deciding whether to generate an entry/exit notification.

Let's sit on this a bit. I want to get more experience chasing down location data points.

(Off-topic: I also want to observe whether/how rnbgl does batching ie. postponing uploads if unable to connect)


From a higher level, I'm going to talk to @thescurry about the limitations of our current debouncing strategy. Maybe we need to change our strategy to be something more aligned with how rnbgl (and iOS) works (... although for that, I need to experiment with it some more first).

toland commented 6 years ago

Beng and I discussed this in a call this morning. I have a lot to say on the subject, but my brain is mush today and so I am going to hold onto my thoughts until my thinking is clearer.


One thing that I thought of after talking to Beng is that we will have problems if the client is doing batch location uploads. When the location upload API was written, we took a decision to only process the first packet in a batch. This was acceptable at the time since the client wasn't actually intentionally doing any batch uploads. The problem is that if we were to start processing all of the updates in a batch upload, it opens a whole new can of worms. Do we send push notifications for entry/exit events in the batch? What if a later event effectively cancels it out? If the oldest entry in a batch is quite old and it would normally cause a notification to be sent, do we still want to send the late notification even though it may no longer be relevant?

In short, processing a batch location upload is difficult technically and introduces a lot of thorny business logic problems. Which isn't to say we shouldn't do it, just, you know, caveat emptor and all that.

thescurry commented 6 years ago

We can circle up on this tonight via a Slack call @bengtan

bengtan commented 6 years ago

I had a good talk with Phil. It raised a number of ideas and things to try.

However, my mind is also jumping all over the place regarding this so I'm going to wait for my opinions to settle down before making more statements. Otherwise I'm going to sound silly when I self-contradict myself.

Also, I want to spend some more time examining the behaviour of the client side library we're using (https://github.com/transistorsoft/react-native-background-geolocation)

TL;DR: percolating ... percolating ...

bengtan commented 6 years ago

Converting into an epic because I've come across a couple of other issues ... which I'm going to post as separate tickets so they don't get conflated together ... but at the same time I want to keep them linked together.

bengtan commented 6 years ago

Did some testing on what happens to location data points uploading if the phone doesn't have connectivity. In other words, 'batching' ... but I'm not sure if batching is a good term for this, so I'll just describe what I saw and then we can decide on terminology (if needed) later. (FWIW, I think a better term for this is 'delayed' or 'buffered' uploading, not batching.)

This is what I did:

  1. At home H. Phone was set to wifi only. No mobile data.
  2. Drove (about 15 mins?) to location A. Phone disconnected from wifi soon after leaving H and had no connectivity.
  3. Stayed at location A for about 10-20 minutes. Turned on mobile data and opened the app for several seconds. Then turned off mobile data.
  4. Drove back home. After arriving home, phone auto-connected to wifi.

So there was connectivity at home, and for several seconds at location A. There wasn't connectivity whilst in transit between home and A (and vice versa).

This is what I found.

If there is no connectivity, the client (via the rnbgl library) continues to generate location data points but doesn't upload them (because it can't). They get stored somewhere (but I don't know how much it can store, or if it persists after an app kill).

Once connectivity is established, the library starts uploading the data points, in chronological order, at about one or two data points per second. So in my test run, the library, once connected, was uploading continuously (1 or 2 points per second) for about a minute or so until it eventually caught up.

The location data points have a timestamp embedded in them. So 'delayed' data points can be easily detected by comparing the 'created_at' timestamp with the 'captured_at' timestamp.

(We can talk about how to handle this, if at all, later. Just posting observations for now.)

toland commented 6 years ago

I'm not sure if batching is a good term for this, so I'll just describe what I saw and then we can decide on terminology (if needed) later.

When I used the term "batching", I was referring to a very specific behavior where the client uploads multiple data points in one packet. The JSON element that contains the actual location data is an array and it is possible, at least in theory, that the client might send multiple locations in one upload. That is the scenario that we don't support right now and that will be painful to support on the server side.

The location data points have a timestamp embedded in them. So 'delayed' data points can be easily detected by comparing the 'created_at' timestamp with the 'captured_at' timestamp.

We might want to ignore data points that are "stale." For example, if the delta between captured_at and created_at is beyond some threshold. It seems obvious that we should ignore data points that are 24 or 48 hours old (if that happens), but less obvious when the threshold is much lower. Probably not something we have to decide now, but worth discussing.

bengtan commented 6 years ago

When I used the term "batching", I was referring to a very specific behavior where the client uploads multiple data points in one packet. The JSON element that contains the actual location data is an array and it is possible, at least in theory, that the client might send multiple locations in one upload. That is the scenario that we don't support right now and that will be painful to support on the server side.

Ah, no wonder you seemed reluctant to handle 'batching'. Yeah, let's not worry about the client uploading an array of points. We'll worry about that if it ever happens (I've not seen it do that yet).

We might want to ignore data points that are "stale." For example, if the delta between captured_at and created_at is beyond some threshold. It seems obvious that we should ignore data points that are 24 or 48 hours old (if that happens), but less obvious when the threshold is much lower. Probably not something we have to decide now, but worth discussing.

Agree. Also agree it's a sideshow, not the main thing.

bengtan commented 6 years ago

Okay, I'm gonna try an updated stab at this.

IMHO, the major problem we have is that ... the client stops uploading once it's stationary. This breaks the assumptions of the debouncing strategy and is the main reason for inaccurate entry/exit events.

We have some other problems like ie. low accuracy data points, delayed (uploading of) data points, no more data points unless more than 200 metres etc., but these are minor and pale in comparison. We can address these (ie. create some heuristics) afterwards.

I want to go back to the main problem.

From the perspective of battery consumption, the client should send as little data points as possible (let's call this 'normal' mode, for lack of a better term). And yes, when it is stationary, its make sense to not send anything.

Unfortunately, this breaks debouncing. Not every time, but it's too often.

At the other end of the extreme, we can make the client library upload location periodically (ie. 'preventSuspend'), but it will kill battery usage.

Reading through the client library's docs, I think there is a possible compromise between normal mode and preventSuspend mode.

The idea is this ... When the device becomes stationary, the client turns on preventSuspend mode for N (N=30?) seconds. After N seconds, or if the device starts moving again (whichever is sooner), the client turns off preventSuspend mode (and resumes normal operation).

When the device is moving, the client generally uploads location data points frequently enough.

I'm hoping this is going to result in much better input into the debouncing algorithm, and result in debouncing working much better.


What should N be? I'm not sure. Naturally, the choice of debounce interval (currently 30 seconds) would have an impact. N has to be more than (double?) the debounce interal.

Battery consumption would also have an impact. Larger N means more batter usage.

It's conceivable that a good balance means a shorter debounce interval (than the current 30 seconds) so that N can be smaller.

I dunno. I'd have to get this idea working first, and observe it, before making any more statements than that.


@toland's idea of training a probabilistic prediction model is something that I think we can do in the (very?) long term. In the short term, I think we'll have to get by with judiciously turning on preventSuspend mode and some better heuristics.


Thoughts, please, anyone?

thescurry commented 6 years ago

@bengtan - I'm open to the idea of turning on preventSuspend for 60 seconds. We're well ahead of a release, so now is a good time to observe/measure how our tweaks might affect battery consumption.

bengtan commented 6 years ago

This was discussed over voice. Here's my interpretation of a summary:

@thescurry is fine with trying out a short-lived preventSuspend. He's also fine with modifying the '30 second' debounce as needed.

However, @bernardd, @toland and I talked, and we think an easier first step (which we should try before preventSuspend on the client-side) is to make better use of the data that the client uploads ie. is_moving, speed, accuracy etc. Possibly we might omit debouncing for some situations when it is obvious that the user is stationary and inside a bot.

So, the new plan is ... we're going to come up with some heuristics to make better use of this data, deploy it to Staging, and see how it performs.

(We'll keep preventSuspend in reserve for later. It might be a step 2, depending on how step 1 goes.)

thescurry commented 6 years ago

Thanks for the summary @bengtan. I would appreciate if you guys [backend team] would find a spot (here or on Slack) to continue this discussion and to keep the rest of us up-to-date on what changes, tweaks, etc that we try in our effort to improve things. ✌️

toland commented 6 years ago

I think a good place to start would be to skip debouncing if the is_moving flag is false or the speed is between 0 and 2. 2 mps is jogging speed, so we should be able to catch most users who are stationary or just walking around and assume that they won't bounce.

thescurry commented 6 years ago

Thanks, Phil. We should probably consider throwing out data when accuracy is low as well (isn't this a low hanging fruit?). It's possible to be walking around downtown LA/SF (tall buildings) or inside a large building and the client might start sending location data marked with low accuracy.

bengtan commented 6 years ago

@thescurry:

We'll continue to discuss on this ticket here (ie. at github). Sometimes we chat privately on Slack, but for important stuff, I'll endeavour to post summaries here.

Occasionally, if there is something big or major enough, I'll also post something on Slack in order to keep everyone else informed.

At this point, @toland has shown enough interest in this topic that I think he'll want to take it on ... but he won't be fully into it at the moment because he wants to finish (or mostly finish) the EKS migration. I'm happy to let him decide on, and implement, the actual tweaks and changes. @bernardd and I are providing him with commentary and counter-balancing viewpoints as we go along.

thescurry commented 6 years ago

@bengtan Why did we decide not to fiddle with preventSuspend (ticket #2598)?

toland commented 6 years ago

Why did we decide not to fiddle with preventSuspend (ticket #2598)?

I'm not Beng, but I can answer your question 😄.

We want to see what changes we can make on the backend first. If we make too many changes at once, it will be hard to know which changes are working and whether any changes are confounding the others. If it turns out that we can't get what we want from backend changes alone, then we can investigate twiddling preventSuspend.

thescurry commented 6 years ago

Thank you proxy Beng! 👍

toland commented 6 years ago

I am about to open a PR with some changes to the debounce algorithm. I have requested a technical review of the code from Bernard and a review of the algorithm and behavioral changes from Steve and Beng. There is a full write-up of the changes on the PR itself.

bengtan commented 6 years ago

A PR #1726 for this ticket has been merged.

Normally, this ticket would move to 'Waiting for Deploment', however, since we expect to have a few more iterations in the future (of which the PR is just the first iteration), I'm moving this ticket to Discussion instead ... because, well, I expect there will be discussion after testing of the first iteration.

toland commented 5 years ago

@bengtan This has been quiet for 7 months. Can we go ahead and close it out?

bengtan commented 5 years ago

Okay, since this has been idle for a long time, let's move to close it. We can always re-open or post new tickets in the future.

bengtan commented 5 years ago

However, I just want to document one more anecdote for future historical reference.

Background:

In August 2018, the location accuracy threshold was configured to 90m (PR 1786, 1806). IIRC, this was as a result of thescurry's testing while he was driving along a circular path to the car garage of WeWork PDC. That particular test case tends to produce location points with large accuracy uncertainty.

However, in November 2018, it is discovered that thescurry's Staging app is set to accuracy VERY_LOW (https://hippware.slack.com/archives/C2V6L53TQ/p1541024070047400). It has since been configured back to the correct value of HIGH.

So ... possibly, the testing in August 2018 may be flawed.

Recently (April 2019):

mstidham experienced two spurious sets of geofence entry/exit events on Prod. I investigated and posted at https://hippware.slack.com/archives/C2V6L53TQ/p1554347057082000?thread_ts=1554338449.079300&cid=C2V6L53TQ. It turns out there were two inaccurate location data points which randomly popped up. Their accuracy values? 67.8 and 88.4 metres.

Make of this what you will.

So ... we'll deal with this another time (and probably in a separate ticket).

yeah, collectively our focus is elsewhere atm. I think we should take note of this and store the issue for now... work on it after we get the android release out. Lots of bits we can polish up to make the experience much better.

https://hippware.slack.com/archives/C2V6L53TQ/p1554404387083300

toland commented 5 years ago

I have always felt that 90m was too high. I am happy to lower it, and it is a very easy change to make. I think 50m is a good value based on analysis of the location data in the database.

Just let me know.

bengtan commented 5 years ago

And we did at: Configure location accuracy uncertainty threshold to 50m #2555