theCrag / website

theCrag.com: Add your voice and help guide the development of the world's largest collaborative rock climbing & bouldering platform
https://www.thecrag.com/
111 stars 8 forks source link

Statistic for located-ness of a node #844

Closed brendanheywood closed 10 years ago

brendanheywood commented 11 years ago

Locatedness of a node is the average accuracy of location of all of its routes. If a route is located we can assume it is accurate to some fixed value like 10m.

If a node is located and its furthest point on one side to the other is 100m then all of its routes are accurate to within 100m unless they have their own locations.

The use case is to quickly scan through a list if crags in list view or wherever and see which ones are well located and which are not at all.

scd commented 11 years ago

If I understand this correct this is a two pass algorithm

  1. Assing locatedness of leaf nodes (eg routes or areas without any routes) by traversing up the ancestors until you find the closest located ancestor.
  2. Rollup the average up ancestors.

There are some tricky parts to this algorithm. If you locate an area for example then you have got to recalculate the locatedness of all the descendant areas/routes.

This will be a good metric to help us determine content quality thresholds for charging via the app.

brendanheywood commented 11 years ago

I'm pretty sure you can get by with a single pass by storing part of the integral of the accuracy at each node and only averaging it at the end on the fly.

TRUE: sum(parent) == sum(child1) + sum(child2) ... sum(childn) but:

FALSE: avg(parent) != avg(child1) + avg(child2) ... avg(childn)

What we want is:

TRUE: avg(parent) == sum(parent) / count(parent)

If each node stores:

Then for a given node you should be able to calculate average locatedness:

  1. Sum all children with accuracies together CA1 + CA2 + CA3 = CA (unit is route meters)
  2. Find's this nodes accuracy, if none work up the tree til you get one -> NA (unit is meters)
  3. Sum the route count of unlocated nodes -> URC
  4. Get the total route count for the node TRC

So the total sum of accuracies is:

Sum = CA + NA * URC

and the average is:

Avg = (CA + NA * URC) / TRC

So if a node's boundary is changed the only stats to get updated are itself and it's ancestors, so at most around 10 nodes. To calculate any nodes avg accuracy this is at most a few dozen multiplications done on the fly.

I'm about 90% confidant the above is correct but I'm pretty sure the concept is sound.

scd commented 11 years ago

I think I totally misunderstand then. This is what I thought:

Is this what you meant at the route level accuracy?

Before I make any further arguments I just want to make sure I am on the same page. There may well be a nicer way of doing it then I first thought, but I think we might be assuming different functionality.

brendanheywood commented 11 years ago

Yup were on the same page.

What I described should fingers crossed give exactly the same result as the two pass process you outlined, just with speed O( log(n) ) instead of O( log(n) * n ) where n = # of routes. The key issue is that the function 'average' is not associative, where as 'sum' is associative so you can safely pre compute the sums and only turn them into averages just-in-time on the fly.

Strictly speaking what I'm describing is also two pass, expect the first pass is done once at migration time to set all the totals for each node, and then updated incrementally when each node or it's descendant is located.

scd commented 11 years ago

I'm not convinced your conclusion that "if a node's boundary is changed the only stats to get updated are itself and it's ancestors, so at most around 10 nodes."

The following scenario: an unlocated cliff with 10 routes, 4 of which are located and is the child of a located sector.

I think your pseudo code is alright, but you have to potentially recalculate at least a portion of the descendants when a boundary changes.

Interestingly we could make on-the-fly calculations using /api/map/summary end point by just adding the node's diameter and multiplying out.

What is the actual use case for this though. If it is putting together a management spreadsheet to review crags then I will do it as part of generating the spreadsheet.

The can we get a visual on the maps where we can quickly see differences between lots of crags. Or are you thinking in list view.

BTW I think it is a very important parameter for us.

brendanheywood commented 11 years ago

Sorry that's a but vague. There is the internal stat, which is stored against each node but never shown which is the sum. It is these that are updated when a node is relocated, and only ancestors need to be touched.

Then there is the average stat which is shown which is calculated on the fly from the sums. And yes this will be different in your scenario, but these are only ever calculated as needed and never stored. You can't cache them safely because you don't know when they are invalid - at least not without doing as much processing as the first method which we can avoid.

Sent from my iPhone

On 13/11/2012, at 8:22 PM, Simon Dale notifications@github.com wrote:

I'm not convinced your conclusion that "if a node's boundary is changed the only stats to get updated are itself and it's ancestors, so at most around 10 nodes."

The following scenario: an unlocated cliff with 10 routes, 4 of which are located and is the child of a located sector.

I think your pseudo code is alright, but you have to potentially recalculate at least a portion of the descendants when a boundary changes.

Interestingly we could make on-the-fly calculations using /api/map/summary end point by just adding the node's diameter and multiplying out.

What is the actual use case for this though. If it is putting together a management spreadsheet to review crags then I will do it as part of generating the spreadsheet.

The can we get a visual on the maps where we can quickly see differences between lots of crags. Or are you thinking in list view.

BTW I think it is a very important parameter for us.

— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/844#issuecomment-10320127.

scd commented 11 years ago

I think we need to store three things on each node to make this work

If NU is not zero then the on-the-fly calculation would need traverse ancestors to find the closest ancestor boundary diameter (NA). This then gets you your formula:

Avg = (CA + NA * URC) / TRC

Which is only ever calculated on-the-fly.

So I think we agree with the metric and it's maths :)

The locatedness metric does make one big assumption and that is the boundary of the closest ancestor is accurate. In otherwords if somebody draws a small boundary around a crag as a placemarker then the metric will be showing the wrong thing.

The reason why I point this out is that Campbell put boundaries around 2000 odd crags, but I don't know if he did any more than a placemarker. I will have to ask.

brendanheywood commented 11 years ago

Yeah the stats is only as good as the data. It could just be located but in the wrong place.

One thing to mitigate this could be to use the bbox instead of the boundary. This way if the crag is small but it's children and bigger then the bbox should autoexpand.

But I'm thinking this will just mask issues and make the stat less accurate when the data is actually good, if the data is wrong we should make it obviously wrong so people fix it.

On Wed, Nov 14, 2012 at 8:02 AM, Simon Dale notifications@github.comwrote:

I think we need to store three things on each node to make this work

  • sum of located nodes (CA)
  • number of unlocated nodes (URC)
  • number of nodes (TRC)

If NU is not zero then the on-the-fly calculation would need traverse ancestors to find the closest ancestor boundary diameter (NA). This then gets you your formula:

Avg = (CA + NA * URC) / TRC

Which is only ever calculated on-the-fly.

So I think we agree with the metric and it's maths :)

The locatedness metric does make one big assumption and that is the boundary of the closest ancestor is accurate. In otherwords if somebody draws a small boundary around a crag as a placemarker then the metric will be showing the wrong thing.

The reason why I point this out is that Campbell put boundaries around 2000 odd crags, but I don't know if he did any more than a placemarker. I will have to ask.

— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/844#issuecomment-10345845.

cheers Brendan

http://thecrag.com/ http://github.com/brendanheywood

scd commented 11 years ago

I was going to use the bounded box anyway which is actually based on the boundary. I was just going to use the max(height,width).

I think you are suggesting we use the cascading bounded box which is the one from the children. I think it is actually a good idea to use the cascaded bound box if it is bigger than the bounded box. Of course we would only use this logic if there was a boundary in the first place.

We are not really masking the issue because a placemarker boundary which is smaller than the crag could produce statistics which make the crag look well located. If we use the bigger (or combined) bounded and cascaded bounded boxes then this will make the crag less well located, which is what we want.

As an aside we probably want to know all areas where the cascaded bounded box is bigger than the bounded box as this is an error condition in the location data.

brendanheywood commented 11 years ago

On Wed, Nov 14, 2012 at 11:05 AM, Simon Dale notifications@github.comwrote:

I was going to use the bounded box anyway which is actually based on the boundary. I was just going to use the max(height,width).

yep thats a good enough proxy

I think you are suggesting we use the cascading bounded box which is the one from the children. I think it is actually a good idea to use the cascaded bound box if it is bigger than the bounded box. Of course we would only use this logic if there was a boundary in the first place.

We are not really masking the issue because a placemarker boundary which is smaller than the crag could produce statistics which make the crag look well located. If we use the bigger (or combined) bounded and cascaded bounded boxes then this will make the crag less well located, which is what we want.

As an aside we probably want to know all areas where the cascaded bounded box is bigger than the bounded box as this is an error condition in the location data.

— Reply to this email directly or view it on GitHubhttps://github.com/theCrag/website/issues/844#issuecomment-10351157.

cheers Brendan

http://thecrag.com/ http://github.com/brendanheywood

scd commented 11 years ago

For my own notes, I have to also add a trigger to recalculated the location stats when we add and delete routes.

scd commented 11 years ago

This is partly implemented (variables are stored), so I know where I am up to I have still got to:

brendanheywood commented 11 years ago

Is a reparent catered for on both the old parent and the new parent?

On 20/11/2012, at 11:09 AM, Simon Dale notifications@github.com wrote:

This is partly implemented (variables are stored), so I know where I am up to I have still got to:

scd commented 11 years ago

Good point - I have to add the triggers for Reparent and Merge

scd commented 11 years ago

Actually I just had a look at the code and the triggers for recalculating all stats are in automatically in reparent and merge code. So, unless there is a bug this one comes for free.

scd commented 11 years ago

I have done some baseline testing for arapiles and found one irritating bug which alluded me for a while (I forgot to query on active locations so sometimes details for inactive locations were being returned - doh.

Anyway I have output a spreadsheet for the locatedness of nodes in Arapiles for review. Because there are no gaps in Arapiles locations the spreadsheet is fairly orderly. I think the results are looking reasonable.

Note to self. Still to

brendanheywood commented 10 years ago

Looks like this one is mostly done and an easy one to close off, just needs to be output somewhere. I've moved from long term into next release

I reckon calculate it for each child node in the area/locate template

brendanheywood commented 10 years ago

I think the most useful place to display this is on the locate page, so then you can quickly drill down into poorly located areas and locate them at the same time.

brendanheywood commented 10 years ago

Thought I'd have a crack at getting this one over the line. I've exposed the locatedness stat on the locate page, adding it to the geometry:

http://dev.thecrag.com/climbing/australia/blackmans-bay/locate

But there seems to be something wrong with one of the stats which throws it all out.

Eg Blackmans bay above has 3 child areas, each which is fully located to an area of 100-200m.

image

Internally their summed total area's are: 530 5260 200

This should give a summed total area for blackmans bay of 5990. The area of blackmans is (I think) 5043, and there are 26 routes so the final equation should look like:

(5990 + 5043 * 0) / 26 = 230m

Which is about what you'd expect.

However internally that sum isn't 5990, it's 131118, which is 5043 * 26.

5043 looks like it's the area of blackmans bay and there are 26 total routes, so either this stat is being generated wrong or it's the wrong stat.

Here is the calcs being done for Blackmans and it's 3 children:

11744707: (131118 + 0 * 5043) / 26 = 5043 11850043: (530 + 0 * 106) / 5 = 106 11850067: (5260 + 0 * 263) / 20 = 263 11850091: (200 + 0 * 200) / 1 = 200

But they should all really look like this:

11744707: (5990 + 0 * 5043) / 26 = 230.384 11850043: (0 + 50 * 106) / 5 = 106 11850067: (0 + 20 * 263) / 20 = 263 11850091: (0 + 1 * 200) / 1 = 200

All three children are only located at the area level, not route level, so all weightings get put into the unlocated side, and the sum located for these should all be zero on the left. Where as at the blackmans bay node, the number of routes without inherited locations is 0.

So it looks to me like both the stat for the sum of locations is not being calculated properly, and on top of that the equation is pulling the wrong stats. Between both errors they actually cancel out and give the right answer for leaf nodes but mess it up further up the tree.

I think the root cause is that the stats that cascade up and being summed and then stored into the wrong stat in their parents node.

brendanheywood commented 10 years ago

It would also help a lot, both for debugging this, and also ongoing for locating stuff, to have the area size for each node exposed.

scd commented 10 years ago

Please push what you have done to repo. I had a quick look and it seems as though you are passing through the area size as the locatedness statistic.

I cannot see anywhere in the code where I had finished the on-the-fly calculation. This makes sense because my todo notes suggests it is still to do.

What I have to do:

brendanheywood commented 10 years ago

Yes that sounds about right. I've renamed that to 'area size' in the template data and html table and pushed it: e25b56f

And sorry I should have been clearer - I was using the /opt/CIDS/script/test-locatedness which looks at least superficially to me like it does mostly implement the algorithm and I was trying to get it working and then move it across into the codebase proper.

Because the calc for this should be quite light, I have added a stub into atomGeometry. Although this may be flawed as we want this to be always present even when the location isn't available.

The only template I want it for to start with is the area/locate page.

Also, to clarify what units is areasize - is it meters or meters squared?

scd commented 10 years ago

I have found the bug for this - it was a simple configuration thing in the way the rollup statistics were being calculated (it was actually a bug across 4 of the newer statistics). It will not work on dev until I recalculate all the rollup stats. I think we should not worry and just do this in prod when I release.

Other minor fixes:

Maybe I don't worry about checking statistics delta mode for now, and just see how it goes. It should work and there are a few use case tests to do this properly. I'm time poor and want to close this issue.

BTW, after the release I will have to rebuild the rollup statistics so I have made a note in the release milestone.

brendanheywood commented 10 years ago

shweet

brendanheywood commented 10 years ago

Should this be working on my dev box? The numbers I see on my system don't look right even now that my box has been rebuilt. How do I rerun the stats process for a particular node or crag and see this in action? Every locatedness just looks the same as the areasize:

image

on-the-fly calculation is efficiently made in the node cache

This item comment above worries me, I can't see how it's possible to cache this because we don't know when it is invalid. That's why it needs to be on-the-fly. Unless the cache is only in memory for a every short time, like a single request or a couple seconds.

scd commented 10 years ago

The stats have not been rebuilt on dev. I was just planing on doing this on prod to save time. On Jul 19, 2014 8:27 PM, "Brendan Heywood" notifications@github.com wrote:

Should this be working on my dev box? The numbers I see on my system don't look right even now that my box has been rebuilt. How do I rerun the stats process for a particular node or crag and see this in action? Every locatedness just looks the same as the areasize:

[image: image] https://cloud.githubusercontent.com/assets/187449/3634424/920b2d0a-0f2c-11e4-90e6-ebb23f46c98f.png

on-the-fly calculation is efficiently made in the node cache

This item comment above worries me, I can't see how it's possible to cache this because we don't know when it is invalid. That's why it needs to be on-the-fly. Unless the cache is only in memory for a every short time, like a single request or a couple seconds.

— Reply to this email directly or view it on GitHub https://github.com/theCrag/website/issues/844#issuecomment-49505623.

scd commented 10 years ago

The stats are rebuilt for your dev. I am currently rebuilding the cache for all nodes, which will be finished by lunch. I did a quick reset of arapiles to confirm that the stat actually works, so you can see that now.

Now for the second part of this discussion, your worry about caching the result. I think it is ok, but there is a small edge case where there may be a problem.

If you locate a descendant then the location stats will be updated. The statistics package should invalidate the cache for all parent nodes as this static propagates up the tree. Unless there is a bug this use case should be fine.

The edge case which may end in something out of date is where an area does not have a location and has to get this from the parent area. If the parent changes then the area used by the unlocated area may be a little out of date.

The only place I see this as a problem is on the actual location screen. Maybe I just force a recalculation of the cache of all nodes on the location screen ????

The advantage of caching the stat is that we can get it everywhere with no performance hit.

brendanheywood commented 10 years ago

Looks really good. The stat I'm pretty sure is a max width, not an area, so I think we should tweak the display to meter, not meters square? And yeah if you can reset it when you use the locate page that would be awesome.

Because the areasize stat is not actually an area, it's probably misleading to have it showing at all (although it would be nice to show this, it would be pretty useful / interesting, so something for later)

scd commented 10 years ago

If it is meters not meters squared there may in fact be a bug based on a conceptual misunderstanding when I first implemented. I am going to have to look into this more. Just don't close.

It explains why if all descendants are unlocated it ends up with the same as area size. I will review my algorithm tomorrow. It was a while since implementing so I just cannot remember.

brendanheywood commented 10 years ago

Locatedness.pm line 70 looks like the crux of the matter. But we are simplifying and using the bbox, not the shape, so this areasize wouldn't be that useful to display if we just returned the bbox area.

I think if we really want area size then it's a separate new independent stat which properly calculates the area using the shape.

brendanheywood commented 10 years ago

And I think conceptually we discussed it being a distance, in my mind 'locatedness' is the average error of the distance I could be from where I need to be, which is best measured as a distance in m.

scd commented 10 years ago

You definitely had just m in the dimensions. This may be my bad. On Jul 23, 2014 4:55 PM, "Brendan Heywood" notifications@github.com wrote:

And I think conceptually we discussed it being a distance, in my mind 'locatedness' is the average error of the distance I could be from where I need to be, which is best measured as a distance in m.

— Reply to this email directly or view it on GitHub https://github.com/theCrag/website/issues/844#issuecomment-49839270.

brendanheywood commented 10 years ago

Nevermind, I'll make a new issue to add that later, I've removed the areasize display for now. Closing

scd commented 10 years ago

I have just looked at my own code again and locatedness is definitely meters (not meters squared). I think I had a false memory, in particular with us mislabeling the areasize variable.